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(54) Integrated retrieval scheme for retrieving semi-structured documents 



(57) An integrated retrieval scheme retrieves data 
involved in a plurality of semi-structured documents 
scattering over open networks and collects the required 
information item by Hern from the semi-structured docu- 
ments through a unified interface without regard to dif- 
ferences in the document structures, presentation 
styles, and elements of the semi-structured documents. 

The search scheme receives a query consisting of 
search items and search conditions from a user (S200). 
The search scheme finds, according to location data 
that specifies the location of each of the semi-structured 
documents, the location of each semi-structured docu- 
ment that contains all search items (S210) and con- 
verts, if necessary, item presentation styles of the 
entered query into that of the location found semi-struc- 
tured documents according to style conversion data 
(S220.S225.S230), and forms queries for the location 
found semi-structured documents, and transmits the 
queries to the found locations and obtains the location 
found semi-structured documents (S240), and extracts 
item data from the obtained semi-structured documents 
according to structure data being used to delimit docu- 
ment into items and attribute data being used for condi- 
tional retrieval, and prepares a search result (S240), 
and converts, if necessary, item presentation styles of 
the search result into the item presentation styles of 
each user according to the style conversion data 
(S250). 
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Description 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0001] The present invention relates to a retrieval 
technique applied to an open network environment that 
involves a plurality of semi-structured documents and 
search engines. In particular, the present invention 
relates to an integrated retrieval scheme by managing 
the location data, document structure data, item data, 
presentation style data, etc., to provide a unffied inter- 
face for retrieving required information item by item from 
a plurality of semi-structured documents irrespective of 
differences among the locations, document structures, 
elements, input forms of search engines. 

2. Description of the Prior Art 

[0002] Increasing performance and decreasing cost in 
personal computers, improvements in network technol- 
ogy, and the growth of inexpensive network providers 
are vitalizing open networks, in particular, the Internet. 
Many information providers employ HTML (hypertext 
markup language), that is description language of hyter- 
text for realizing easy contents creation, to transmit var- 
ious informations to users through the open networks. 
The number of information providers is increasing due 
to an exploding increase in information consumers. This 
results in accumulating various kinds of information in 
the networks, and it is required to efficiently provide 
each consumer with necessary information from among 
the accumulated pieces of information. 
[0003] The consumers want to entirely retrieve 
desired information from across information sources, ft 
is hardly granted because information accumulated in 
the open networks is mostly in HTML documents that 
have mutually different structures, presentation styles, 
or search formats to retrieve devised information from 
across different information sources. 
[0004] Information retrieval apparatus, so called, 
search engines are widely used with respect to retriev- 
ing HTML documents scattered over the network Here, 
the search engine is a generic term for system retrieving 
certain information through input form. Figure 1 shows 
an information retrieval technique according to a prior 
art using URL search engine. The URL search engine is 
a search engine returning URL as search result with 
respect to query with keyword or conditional term. For 
example, a user has an interest in "a PC of 100.000 yen 
or below." The user enters keywords into an URL search 
engine. Figure 2 shows an example of an URL search 
engine according to a prior art The URL search engine 
900 has a keyword index 910 that contains keywords 
and locations, i.e., URLs related to HTML documents 
spreading over networks, the keyword index 910 is reg- 
istered in advance. A search processor 930 searches 



the keyword index 910 for the keywords entered by the 
user and returns a list of URLs and outlines, the URL 
indicates location of HTML documents that contain the 
entered keywords and its synonym. Returning to Fig. 1, 
5 the user accesses the returned HTML documents one 
by one to find out necessary information. In this way, 
first the users had to find out the locations of HTML 
documents that may contain necessary information by 
wide document search, and then inspect each ol the 
w HTML documents in obtained URL list tor the necessary 
information when obtaining the information from HTML 
documents of which is unknown, so that it needs long 
time and labor to obtain necessary information. The 
users must spend much time and labor before they get 
is necessary information. In addition, the prior arts are 
incapable of collectively retrieving information from 
across a plurality of HTML documents. 
[0005] The prior arts may find out the locations of 
HTML documents that contain given keywords and the 
20 synonyms thereof but are unable to collect information 
item by item by collectively retrieving information 
involved in HTML documents. The prior arts are unable 
to set conditions on search results. For example, they 
are unable to filter search results by date. And. when 
25 using URL search engine that provides search interface 
for each HTML document as input form, users must take 
into account such individual form input interface for 
each URL search engine and access each URL search 
engine one by one. 
30 [0006] More particularly, HTML documents employed 
in on-line shops of electronic commerce frequently 
show the product information such as names and prices 
with list description of table or clause style that includes 
one meaningful clustered data. There are demands to 
35 retrieve information collectively among these HTML 
documents of on-line shops. For example, a user may 
want to retrieve information about shops that offer the 
lowest price for a specific product. In this case, the user 
enters the name, maker, category, etc., of the product 
40 as keywords. Then, the prior art of Fig. 1 provides the 
user with the locations of HTML documents related to 
the keywords. The user accesses the HTML documents 
one by one to check to see if they offer the product 
under preferable conditions. The prior art of Fig. 1 , how- 
45 ever, searches the full text of each HTML document for 
the entered keywords without considering elements that 
form the HTML document, and therefore, tends to 
retrieve a lot of irrelevant data for the user. Accordingly, 
the user must spend much time and labor to find out the 
so necessary information from among the HTML docu- 
ments retrieved by the prior art. 
[0007] The prior arts are incapable of retrieving infor- 
mation from a given HTML document item by item. For 
example, they are unable to extract the price, image. 
55 maker, etc., of a given product from a given HTML doc- 
ument containing product information table. The prior 
arts are unable to extract the name, phone number, 
address, etc., of each shop from a given HTML docu- 
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ment containing claused-shop information. The prior 
arts are unable to set conditions such as date to fitter 
results retrieved from HTML documents. 
[0008] There is a conventional technique that creates 
a hypothetical database by mapping the internal struc- s 
ture of each document and relationships between docu- 
ments into unique models, to extract itemized pieces of 
information. This technique was disclosed by N. Ashish 
and C. A. Knoblock in "Semi-automatic wrapper gener- 
ation for Internet information sources," Proceedings of 
Cooperative Information Systems, 1997. This technique 
considers a portion in HTML document as meaningful 
information, the portion has specific tags such as TITLE 
tag such as size, color, typestyle (eg., bold and italic), 
and extracts these information automatically. This tech- 
nique cover a case that minimum cluster of certain infor- 
mation is described in one HTML document, and a 
plurality of the HTML documents are described in mutu- 
ally same format. This technique is, for example, effec- 
tive when regionalized weather information is described 
in different HTML documents. However, this technique 
doesn't take into account a case that information is 
described as a list description such as table or clause in 
one HTML document. Accordingly, this technique is 
unable to be applied to the above case. 
[0009] J. Hammer, H. Garcia-Molina, J. Cho. R. Araha. 
and A. Crespo disclosed another technique in "Extract- 
ing semistructured information from the web." Work- 
shop on Management of Semistructured Data. 1997. 
This technique creates a hypothetical database by 
employing an unique OEM data model, and manage 
relationship between the database and various informa- 
tion sources, and therefore, retrieve information from 
heterogeneous web sources integratively. This tech- 
nique employs template file depending on HTML tag 
description rule for HTML document to manage above 
relationship. However, in this technique, modfication in 
HTML document affect hypothetical database and also 
modification in hypothetical database affect application. 
Accordingly, this technique need much labor for man- 
agement and maintenance of system. 
[001 0] There are no standards for HTML descriptions 
used for information providing such as products handled 
by on-line shops. Namely, on-line shops are using indi- 
vidual HTML documents. This will be explained. 
[0011] HTML documents prepared by on-line shops 
have different document structures. For example, a 
shop A employs a tag TABLE to describe products in 
table format, while a shop B employs a tag ULto itemize 
products in clause format. 

[001 2] The HTML documents of on-line shops employ 
different presentation styles even for the same product. 
For example, yen, thousand yen, ten-thousand yen, dol- 
lars, etc., are used as unit prices depending on shops. 
Some shops use double-byte characters to express 
prices and others employ single-byte characters for the 
same purpose 

[0013] The HTML documents of on-line shops have 



different data elements even for the same product. For 
example, a product is represented with only the name 
thereof, or the name and model number thereof, or the 
maker, name, and model number thereof depending on 
shops. To get necessary information from HTML docu- 
ments gathered by the conventional retrieval tech- 
niques, users must extract pieces of information from 
the documents and compare them with one another. It 
takes a long time and much labor to retrieve necessary 
data from them. 

[0014] In addition, when using plural search engines, 
the search engines used to search open networks for 
required information differ from one another in informa- 
tion types to handle, performance, and fees, and there- 
fore, the users must choose them depending on 
situations. In otherwise, for this purpose, the users must 
know the locations, and interfaces of the search engines 
peculiarly. 

[0015] First, it is difficult to find and manage the loca- 
tions of search engines. The users must individually 
manage the locations of search engines with the use of, 
for example, bookmarks. This is hard to achieve in an 
environment using all terminal but own terminal, such 
as moUe environment. 

[0016] Second, the search interfaces of search 
engines provided by input forms are not unified. Many 
search engines employ their own input forms of which 
structure are not unified. Accordingly, the users must 
acquire separate systems and operation sequences 
and schemes when handling different search engines. It 
is hard for the users to know which search engine is 
effective for certain search Hem. It is also hard for the 
users to process information conditionally contained in 
retrieved HTML documents. 

[0017] Third, the search information through search 
engines are inefficient. The users must handle several 
search engines until they get required information. This 
involves many search operations and is inefficient. 
[0018] Fourth, the search engines return search result 
that is different item presentation styles, character 
codes, etc.; when presenting search results, and it is dif- 
ficult for the users to compare the search results with 
one another. 

[0019] To solve the heterogeneity among the search 
engines, Jumon World Seek at http7/mem- 
ber.nifty.ne.jp/jumon has disclosed a technique of pre- 
paring a common search interface for URL search 
engines that is one kind of search engine, managing 
relationships between the common search interface 
and individual interface for URL search engines, con- 
verting a search request for the common search inter- 
face into search requests for the search engines, and 
executing the search requests for the search engines. 
This technique provides the common search interface 
employing a single text box to handle the URL search 
engines. In practice, there are not only the URL search 
engines but also other various search engines. To use 
such a variety of search engines, this technique has the 
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following problems: 

(1) Necessity of considering a plurality of input items 

[0020] Some search engines employ a simplest input 
form with a single text box for entering keywords to 
search. To narrow information to retrieve, some search 
engines allow the users to enter search conditions such 
as an area and an industry field in addition to keywords. 
However, the technique mentioned above is incapable 
of achieving such a narrowing search operation 
because it does not support a plurality of input items. 

(2) Necessity of coping with a variety of input forms 

[0021] To properly enter search conditions, some 
search engines employ several input form objects for 
text input such as text boxes, radio buttons for selecting 
one among several items, and select boxes or check 
boxes for selecting some among several items. The 
technique mentioned above is incapable of coping with 
these data entering objects except for text box because 
it supports only a single text box. 

(3) Reconstruction of application 

[0022] When adding, correcting, deleting search 
engines with respect to the common search interface, 
the technique mentioned above must correct the com- 
mon search interface and reconstruct corresponding 
applications. 

[0023] In this way. the conventional technique men- 
tioned above is incapable of coping with a variety of 
search engines and needs a lot of time ami labor to 
design, maintain, and manage. 

SUMMARY OF THE INVENTION 

[0024] An object of the present invention is to provide 
an integrated retrieval scheme capable of retrieving 
required information from a plurality of semi-structured 
documents such as HTML documents that are scatter- 
ing over open networks and have different document 
structures, presentation styles, and information ele- 
ments, converting the retrieved information into a uni- 
fied form for each user, and returning the information in 
the unified form to the user. 

[0025] Another object of the. present invention is to 
provide an integrated retrieval scheme capable of indi- 
vidually managing input form objects of each search 
engine serving for open networks to resolve differences 
among the search engines, generating search requests 
specific to the search engines according to a user's 
search request and executing search operations with 
respect to the search engines in open network environ- 
ment including many search engines. 
[0026] Still another object of the present invention is 
to provide an integrated retrieval scheme capable of 



managing the location, document structure, and Hern 
attributes of each HTML document and extracting 
required information item by item from different HTML 
documents that differs in the location, the document 

5 structure, and attributes arbitrary 

[0027] In order to accomplish the objects, an aspect of 
the present invention provides an apparatus for retriev- 
ing data contained in a plurality of semi-structured doc- 
uments over open networks, comprising: a unit for 

10 storing meta data for each of the semi-structured docu- 
ments, the meta data including items to be extracted 
from the semi-structured documents and item data 
used to conditionally retrieve the items; a unit for retriev- 
ing data scattered among the semi-structured docu- 

15 ments for entered query according to the meta data, 
and preparing a collective search result; and a unit for 
o inputting the search result in a prescribed single for- 
mat that is specific to each user. 
[0028] Another aspect of the present invention pro- 

20 vides an apparatus for retrieving data contained in a plu- 
rality of semi-structured documents over open 
networks, comprising: (a) a unit for storing location data 
about the location of each of the semi-structured docu- 
ments, document structure data about the structure of 

25 each of the semi-structured documents, used to delimit 
document into items to be extracted, attribute data 
about the attributes of each of the items to be extracted, 
used to conditionally retrieve the items, and style con- 
version data used to convert item presentation styles of 

30 the user and item presentation styles of the semi-struc- 
tured documents from one into another; (b) a unit for 
finding, according to the location data, the location of a 
semi-structured document that contains all search 
items specified in an entered query that consists of the 

35 search items and search conditions; (c) a unit for con- 
verting, if necessary, item presentation styles of the 
entered query into item presentation styles of the 
search item in location found semi-structured docu- 
ments according to the style conversion data, and form- 

40 ing queries for the location found semi-structured 
documents; (d) a unit for transmitting the queries pro- 
vided by the unit (c) to the found locations and acquiring 
the semi-structured documents; (e) a unit for extracting 
item data from the acquired semi-structured documents 

45 according to the document structure data, selecting the 
extracted item data, if necessary, according to the 
attribute data for the search condition, and preparing a 
search result; and (f) a unit for converting, if necessary, 
item presentation styles of the search result into the 

so item presentation styles of each user according to the 
style conversion data. 

[0029] Still another aspect of the present invention 
provides an apparatus for retrieving data through 
search engines over open networks, comprising: (aa) a 
55 unit for storing location data about the location of each 
search engine, essential input item data specifying 
essential input items required by an input form of each 
search engine, document structure data about the 
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structure of each HTML document, used to delimit doc- 
ument into items to be extracted, attrfoute data about 
the attributes of the items to be extracted, used to con- 
ditionally retrieve the items, and style conversion data 
used to convert item presentation styles of a user and 
item presentation styles of each HTML document from 
one into another; (bb) a unit for finding, according to the 
location data, the location of a search engine that con- 
tains all search items specified in an entered query that 
consists of the search Items and search conditions; (cc) 
a unit for selecting, according to the essential input item 
data, search engine to be searched from among the 
location found search engines, the search engine of 
which the essential input item satisfy the specified 
search condition; (dd) a unit for determining an optimum 
retrieval pattern for each of the selected search engines 
according to a matrix table and converting the entered 
query into queries for the selected search engines 
accordingly, the matrix table defining combination 
between the search items and search conditions and 
the items and essential input items of each search 
engine; (ee) a unit for converting, if necessary, item 
presentation styles of the queries provided by the unit 
(dd) into item presentation styles of the search item in 
selected search engines according to the style conver- 
sion data; (ff) a unit for transmitting the queries provided 
by the unit (ee) to the found locations and acquiring 
HTML documents; (gg) a unit for extracting item data 
from, the acquired HTML document serving as a first 
search result according to the structure data, selecting 
the extracted item data, if necessary, according to the 
attribute data for the search condition on the basis of 
corresponding retrieval pattern and preparing a second 
search result; and (hh) a unit for converting, if neces- 
sary, item presentation styles of the second search 
result into item presentation styles of each user accord- 
ing to the style conversion data. 
[0030] Still another aspect of the present invention 
provides an apparatus for extracting data item by item 
from arbitrary HTML document over open networks, 
comprising: (aaa) a unit for storing a template for each 
HTML document according to document structure data 
about the structure of the HTML document used to 
delimit document into items to be extracted, the tem- 
plate stipulating at least item name to be extracted and 
prescribed text extraction style data of item group to be 
extracted from the HTML document; (bbb) a unit for 
analyzing a template corresponding to acquired HTML 
document; and (ccc) a unit for comparing the acquired 
HTML documents with corresponding template by scan- 
ning the acquired HTML document, and extracting item 
data of the items matching the text extraction style data 
of the template, so as to prepare a search result. 
[0031] Still another aspect of the present invention 
provides a method of retrieving data contained in a plu- 
rality of semi-structured documents over open net- 
works, comprising the steps of: retrieving data scattered 
among semi-structured documents for entered query 



according to meta data about each of the semi-struc- 
tured documents and preparing a collective search 
result, the meta data including items to be extracted 
from the semi-structured documents and item data 
5 used to conditionally retrieve the items; and o inputting 
the search result in a prescribed single format that is 
specific each the user. 

[0032] Still another aspect of the present invention 
provides a method of retrieving data contained in a plu- 

10 ralrty of semi-structured documents over open net- 
works, comprising the steps of: (a) finding, according to 
location data that specifies the location of each of the 
semi-structured documents, the location of a semi- 
structured document that contains ail search items 

15 specified in an entered that consists of the search items 
and search conditions; (b) converting, if necessary, item 
presentation styles of the entered query into item pres- 
entation styles of the search item in location found semi- 
structured documents according to style conversion 

20 data and forming queries for the location found semi- 
structured documents, the style conversion data being 
used to convert item presentation styles of a user and 
item presentation styles of the semi-structured docu- 
ments from one into another; (c) transmitting the que- 

25 ries provided by the step b) to the found locations and 
acquiring the semi-structured documents; (d) extracting 
item data from the acquired semi-structured documents 
according to document structure data, selecting the 
extracted item data, if necessary, according to attribute 

30 data for the search condition and preparing a search 
result, the document structure data specifying the struc- 
ture of each of the semi-structured documents and 
being used to delimit document into items to be 
extracted, the attribute data specifying the attributes of 

35 each Hem to be extracted and being used to condition- 
ally retrieve the items; and (e) converting, if necessary, 
item presentation styles of the search result into the 
item presentation styles of each user according to the 
style conversion data. 

40 [0033] Still another aspect of the present invention 
provides a method of retrieving data through search 
engines over open networks, comprising the steps of: 
(aa) finding, according to location data that specifies the 
location of each search engine, the location of a search 

45 engine that contains all search items specified in an 
entered query that consists of the search items and 
search conditions; (bb) selecting, according to essential 
input item data that specifies essential input items 
required by an input form of each search engine, search 

so engine to be searched from among the location found 
search engines, the search engine of which the essen- 
tial input item satisfy the specified search condition; (cc) 
determining an optimum retrieval pattern for each of the 
selected search engines according to a matrix table and 

55 converting the entered query into queries for the 
selected search engines accordingly, the matrix table 
defining combination between the search items and 
search conditions and the items and essential input 
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items of each search engine; (dd) converting, K neces- 
sary, item presentation styles of the queries provided by 
the step (cc) into item presentation styles of the search 
item in selected search engines according to style con- 
version data that is used to convert Hem presentation 
styles of a user and item presentation styles of each 
HTML document from one into another; (ee) transmit- 
ting the queries obtained by the step (dd) to the found 
location and acquiring HTML documents; (ft) extracting 
item data from the acquired HTML document serving as 
first search result according to document structure data, 
selecting, if necessary, the extracted item data accord- 
ing to attribute data for the searching condition on the 
basis of corresponding retrieval pattern, and preparing 
a second search result, the document structure data 
specifying the structure of each HTML document and 
being used to delimit document into items to be 
extracted, the attribute data specifying the attributes of 
the items to be extracted and being used to condition- 
ally retrieve the items; and (gg) converting, if necessary, 
item presentation styles of the second search result into 
item presentation styles of each user according to the 
style conversion data. 

[0034] Still another aspect of the present invention 
provides a method of extracting data item by item from 
arbitrary HTML document over open networks, compris- 
ing the steps of :(aaa) analyzing a template correspond- 
ing to acquired HTML document, the template for each 
HTML document being set according to document 
structure data that specifies the structure of each HTML 
document and is used to delimit document into items to 
be extracted, the templates stipulating at least item 
name to be extracted and prescribed text extraction 
style data of item group to be extracted from the corre- 
sponding HTML document; and (bbb) comparing the 
acquired HTML documents with corresponding tem- 
plate by scanning the acquired HTML document, and 
extracting item data of the items watching the text 
extraction style data of the template, so as to prepare a 
search result. 

[0035] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 
cute processing for retrieving data contained in a plural- 
ity of semi-structured documents over open networks, 
the processing including: a process for retrieving the 
data scattered among semi-structured documents for 
entered query according to meta data about each of the 
semi-structured documents and preparing a collective 
search result, the meta data including items to be 
extracted from the semi-structured documents and item 
data used to conditionally retrieve the items; and a proc- 
ess for outputting the search result in a prescribed sin- 
gle format that is specif ic each the user. 
[0036] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 
cute processing for retrieving data involved in a plurality 



o1 semi-structured documents over open networks, the 
processing including: (a) a process for finding, accord- 
ing to location data that specifies the location of each of 
the semi-structured documents, the location of a serrti- 

5 structured document that contains all search items 
specif ied in an entered that consists of the search items 
and search conditions; (b) a process for converting, if 
necessary, item presentation styles of the entered 
query into item presentation styles of the search item in 

10 location found semi -structured documents according to 
style conversion data and forming queries for the loca- 
tion found semi-structured documents, the style conver- 
sion data being used to convert item presentation styles 
of a user and item presentation stytes of the semi-struc- 

75 tured documents from one into another; (c) a process 
for transmitting the queries provided by the process (b) 
to the found locations and acquiring the semi-structured 
documents; (d) a process for extracting item data from 
the acquired semi-structured documents according to 

20 document structure data, selecting the extracted item 
data, if necessary, according to attribute data for the 
search condition and preparing a search result, the doc- 
ument structure data specifying the structure of each of 
the semi-structured documents and being used to 

25 delimit document into items to be extracted, the attribute 
data specifying the attributes of each item to be 
extracted and being used to conditionally retrieve the 
items; and (e) a process for converting, if necessary, 
item presentation styles of the search result into the 

30 item presentation styles of each user according to the 
style conversion data. 

[0037] Still another aspect of the present invention 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 

35 cute processing for retrieve data through search 
engines over the open networks, the processing includ- 
ing: (aa) a process for finding, according to location data 
that specifies the location of each search engine, the 
location of a search engine that contains all search 

40 items specified in an entered query that consists of the 
search items and search conditions; (bb) a process for 
selecting, according to essential input item data that 
specifies essential input items required by an input form 
of each search engine, search engine to be searched 

45 from among the location found search engines, the 
search engine of which the essential input Hern satisfy 
the specified search condition; (cc) a process for deter- 
mining an optimum retrieval pattern for each of the 
selected search engines according to a matrix table and 

so converting the entered query into queries for the 
selected search engines accordingly, the matrix table 
defining combination between the search items and 
search conditions and the items and essential input 
Hems of each search engine; (dd) a process for convert- 

55 ing. if necessary, item presentation styles of the queries 
provided by the process (cc) into item presentation 
styles of the search item in selected search engines 
according to style conversion data that is used to con- 
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vert item presentation styles of a user and item presen- 
tation styles of each HTML document from one into 
another; (ee) a process for transmitting the queries 
obtained by the process (dd) to the found location and 
acquiring HTML documents; (ff) a process for extracting 5 
item data from the acquired HTML document serving as 
first search result according to document structure data, 
selecting, if necessary, the extracted item data accord- 
ing to attribute data for the searching condition on the 
basis of corresponding retrieval pattern, and preparing 10 
a second search result, the document structure data 
specifying the structure of each HTML document and 
being used to delimit document into items to be 
extracted, the attribute data specifying the attributes of 
the items to be extracted and being used to condition- 15 
ally retrieve the items; and (gg) a process for converting, 
if necessary, item presentation styles of the second 
search result into item presentation styles of each user 
according to the style conversion data. 
[0038] Still another aspect of the present invention so 
provides a computer readable recording medium 
recording a program for causing the computer to exe- 
cute processing for extracting data item by item from 
arbitrary HTML documents over open networks, the 
processing including: (aaa) a process for analyzing a 25 
template corresponding to acquired HTML document, 
the template for each HTML document being set 
according to document structure data that specifies the 
structure of each HTML document and is used to delimit 
document into items to be extracted, the templates stip- 30 
ulating at least item name to be extracted and pre- 
scribed text extraction style data of item group to be 
extracted from the corresponding HTML document; and 
(bbb) a process for comparing the acquired HTML doc- 
uments with corresponding the template by scanning 35 
the acquired HTML document, and extracting item data 
of the items matching the text extraction style data of the 
template, so as to prepare a search result. 
[0039] Other and further objects and features of the 
present invention will become apparent from the follow- 40 
ing description taken in conjunction with the accompa- 
nying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

45 

[0040] 

Figure 1 shows a sequence of processes for 
searching HTML documents for required informa- 
tion according to a prior art; so 
Fig. 2 shows the principle of a conventional search 
technique; 

Fig. 3 shows a sequence of processes for search- 
ing HTML documents for required information 
according to an integrated retrieval technique of the ss 
present invention; 

Fig. 4 shows the principle of the integrated retrieval 
of the present invention; 



Fig. 5 shows a HTML document integrated retrieval 
apparatus according to a first embodiment of the 
present invention; 

Fig. 6 shows the structure of a HTML document 
meta data storing unit arranged in the apparatus of 
Fig. 5; 

Rg. 7 is a f tow chart showing a preparatory phase 
of the first embodiment; 

Rg. 8 is a flow chart showing an execution phase of 
the first embodiment; 

Rgs. 9A and 9B show the exemplary display and 

HTML description of an HTML document; 

Rgs. 10A and 10B show the display and HTML 

description of another HTML document; 

Rg. 1 1 shows an example of an HTML document 

table stored in the storing unit of Fig. 6; 

Rg. 12 shows an example of a HTML document to 

table mapping table stored in the storing unit of Rg. 

6; 

Rg. 13 shows an example of a HTML document 

item table stored in the storing unit of Fig. 6; 

Rg. 1 4 shows an example of a domain table stored 

in the storing unit of Fig. 6; 

Rg. 15 shows an example of a user domain table 

stored in the storing unit of Fig. 6; 

Rg. 16 shows an example of a domain conversion 

function table stored in the storing unit of Fig. 6; 

Rg. 17 shows an Internet information integrated 

retrieval accorcfing to a second embodiment of the 

present invention; 

Rg. 1 8 shows a HTML document meta data storing 

unit according to the second embodiment arranged 

in the apparatus of Rg. 17; 

Rgs. 19A. 19B, and 19C show examples of input 

forms of search engines according to the second 

embodiment; 

Rg. 20 shows an HTML description corresponding 

to the input form of Rg. 19B; 

Rg. 21 is a flow chart showing a preparatory phase 

of the second embodiment; 

Rg. 22 shows an example of a HTML document 

item table stored in the storing unit of Fig. 18; 

Rg. 23 shows an example of a HTML document 

table stored in the storing unit of Fig. 18; 

Rg. 24 shows an example of a HTML document to 

table mapping table stored in the storing unit of Rg. 

18; 

Rg. 25 shows an example of a domain table stored 

in the storing unit of Fig. 18; 

Rg. 26 shows an example of a domain conversion 

function table stored in the storing unit of Fig. 18; 

Rg. 27 shows an example of a user domain table 

stored in the storing unit of Rg. 18; 

Rg. 28 shows an example of an essential item table 

stored in the storing unit of Rg. 18; 

Rg. 29 shows simplified relationships between the 

apparatus of the second embodiment and search 

engines in processing of search request; 
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Fig. 30 shows a search pattern matrix table accord- 
ing to the second embodiment; 
Rg. 31 is a flow chart showing an execution phase 
of the second embodiment; 
Rg. 32 shows a location for data items in step S410 
of Fig. 31; 

Rgs. 33 to 35 show retrieval pattern for pages A to 
C prepared in step $440 of Rg. 31 ; 
Rg. 36 shows relationships between user input 
domains and local domains prepared in step S450 
of Fig. 31; 

Rgs. 37A and 37B show the exemplary display and 
HTML description of a search result from page B; 
Rg. 38 shows relationships between local domains 
and user output domains prepared in step S500 of 
Rg.31; 

Rg. 39 shows a HTML document information 
extraction apparatus according to a third embodi- 
ment of the present invention; 
Rg. 40 is a flow chart showing a preparatory phase 
of the third embodiment; 
Rg. 41 shows an example of a proxy setting file; 
Rg. 42 shows an example of a template file; 
Rg. 43 shows an example of a URL-template table; 
Rg. 44 is a flow chart showing an execution phase 
of the third embodiment; 

Rg. 45 shows a display of an HTML document on a 
Web browser; 

Rg. 46 shows a part of HTML description corre- 
sponding to the display of Fig. 45; 
Rg. 47 shows a template file for extracting item 
data from the HTML document of Rg. 45, Fig. 46; 
Rg. 48 shows an example of extraction made from 
the HTML document of Fig. 45 according to the 
template file of Rg. 47; and 
Rg. 49 shows a display of an HTML document on a 
Web browser according to a modification of the 
third embodiment; 

Rg. 50 shows a display of an HTML document 
linked to the HTML document of Fig. 49 having a 
same structure as the HTML document of Fig. 49 
on a Web browser; 

Rg. 51 shows an HTML description corresponding 
to the display of Fig. 49; and 
Rg. 52 shows an HTML description corresponding 
to the display of Rg. 50. 

DETAILED DESCRIPTION OF THE EMBODIMENTS 

[0041] Various embodiments of the present invention 
will be described in detail with reference to the accom- 
panying drawings. In this ^ecification, the semi-struc- 
tured documents include documents or other materials 
described in HTML (hypertext markup language). 
SGML (standard generalized markup language). XML 
(extensive markup language), etc. The explanation of 
the embodiments is based on HTML documents if not 
specifically mentioned. Note that following embodi- 



ments are able to be applied to SGML document and 
XML document with appropriate modification. An input 
form provided by search engine for information retrieval 
consist of HTML document. Therefore, the HTML docu- 

5 ments include these input forms furnished for search 
engines in following explanation. The present invention 
is widely applicable to applications that utilize plural 
HTML documents that differ mutually in various aspects 
connected together through open networks. For exam- 

10 pie, the present invention Is applicable to electronic 
commerce or information retrieval on electronic libraries 
and electronic catalogues. 

[0042] The principle of the semi-structured document 
integrated retrieval scheme of the present invention will 

is be explained with reference to Rgs. 3 and 4. 

[0043] Rg. 3 shows an image of operation sequence 
for user according to the present invention. In Rg. 3, a 
user enters a search request for, for example, a PC of 
100,000 yen or below into an apparatus that realizes the 

20 integrated retrieval scheme of the present invention. 
The apparatus flexibly retrieves required information 
involved in HTML documents and provides the user with 
a collective search result. The search request may be 
made not only in conventional keywords but also in sim- 

25 pie syntactical query statement consists of search item 
and search condition. Namely, the present invention is 
capable of handling conditional search such as a search 
for a PC of "100,000 yen or below." 
[0044] Unlike structural data structured item by item 

30 such as RDB data, the HTML documents are so called 
semi-structured data in which data is structured in cer- 
tain degree by using tags, even though HTML docu- 
ments are plain text basically. For example, data group 
related to one subject such as table, list and clause 

35 involved in HTML document may be contained over sev- 
eral HTML documents, or several data groups may be 
contained in a single HTML document. It is hard to con- 
ditionally retrieve item data corresponding to a given 
item from these data groups. Search engines have 

40 HTML-described input forms that may have fixed search 
entries or search entries that must be filled in for indica- 
tion of search condition. The apparatus of the present 
invention is capable of flexibly coping with a user's 
search request and providing the user with a collective 

45 search result. 

[0045] Fig. 4 shows the principle of the apparatus of 
the present invention. The apparatus 1 has a HTML 
document storing unit 15 for storing meta data about 
HTML documents. The meta data includes the loca- 

50 tions, document structures, presentation styles, etc., of 
the HTML documents for each HTML document. The 
locations of the HTML documents are. for example, 
URLs. The document structure data of the HTML docu- 
ments specifies the structures of partial structure such 

55 as tables, lists and clauses contained in the HTML doc- 
uments and is used to map element data in the tables 
and lists to items to be extracted. More particularly, the 
document structure of a given HTML document indi- 
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cates that data pieces corresponding to the items to be 
extracted contained in the HTML document are sepa- 
rated from one another with delimiters such as tags and 
slashes. Each field between delimiter such as tag and 
slash in the HTML documents is related to an item and 
is managed in table format etc., by the storing unit 15. 
Data pieces contained in the HTML documents fre- 
quently employ different presentation styles even if they 
fall in the same weaning. The presentation styles stored 
in the storing unit 15 indicate each one of presentation 
style employed by the HTML documents. 
[0046] A user of the apparatus 1 enters a search 
request into a query processing unit 13. The query 
processing unit 13 refers to the meta data stored in the 
HTML document storing unit 15 and specifies the loca- 
tions, document structures, and presentation styles of 
HTML documents related to the search request. The 
query processing unit 13 acquires the HTML docu- 
ments, extract information from the HTML documents 
with the use of the specified meta data, and condition- 
ally processes the extracted information if necessary. 
Therefore, the apparatus 1 provides the user with a col- 
lective search result involved in HTML documents in 
presentation styles that are optimum for the user. 
Namely, with a single search request the user is able to 
collectively receive required information from the HTML 
documents scattering over networks through the appa- 
ratus 1 of the present invention. This improves search 
efficiency and reduces traffic congestion in the net- 
works. 

[0047] In this way, first, the apparatus of the present 
invention manages the structure information of semi- 
structured documents such as HTML documents con- 
nected to open networks and retrieves requested infor- 
mation item by item from plural HTML documents. 
Second, the apparatus of the present invention is capa- 
ble of retrieving necessary information from Web infor- 
mation documents through search engines without 
bothering the user with differences among the search 
methods of various Web sources. 

First embodiment 

[0048] An HTML document information integrated 
retrieval apparatus of the first embodiment according to 
the present invention concerning semi-structured docu- 
ment information retrieval scheme will be explained with 
reference to Figs. 5 to 16. 

[0049] HTML documents are scattering over open net- 
works and have individual document structures, presen- 
tation styles, and partial structures such as tables 
containing different elements. The first embodiment 
retrieves required information involved in various HTML 
documents and provides a user with a collective search 
result in presentation styles that are optimum for the 
user. 

[0050] A concept regarding the presentation styles 
and terms used for the embodiments will be explained 



first HTML documents employ different presentation 
styles to express even the same meaning. For example, 
the price of a product is expressed like "¥1,000," "one 
thousand yen," or "1 ,000 yen" depending on the writers 
5 of HTML documents. Terms employed by this specifica- 
tion will be explained. 

[0051] A domain is equal to one presentation style. 
For example, "1 ,000 yen" for a price is a with-yen pres- 
entation style that forms a domain, and "VI ,000" is a 
w with-V presentation style that forms a domain. 

[0052] A domain group is a collection of domains 
related to the same meaning. For example, prices form 
a domain group, and dates (year, month, day) form a 
domain group. 

is [0053] A user input domain is a domain related to a 
user's search request input. For example, the with-yen 
presentation style for a price forms a user input domain, 
and the Christian era for a date with T as a delimiter 
forms a user input domain. 

20 [0054] A user output domain is a domain related to a 
search result for a user. For example, the with-V presen- 
tation style for a price forms a user output domain, and 
an abbreviated date for a date with V as a delimiter 
forms a user output domain. 

25 [0055] A user domain covers user input and output 
domains. 

[0056] A local domain is a domain in a given HTML 
document. For example, the with-yen presentation style 
for a price forms a local domain. 

30 [0057] A domain conversion function is a function for 
converting a user input domain into a local domain, or a 
local domain into a user output domain. 
[0058] If different user input domains, user output 
domains, and local domains are involved, the difference 

35 will be resolved by the domain conversion functions. 
Fig. 5 is a block diagram showing a configuration of 
HTML document information integrated retrieval appa- 
ratus according to the first embodiment. 
[0059] In Fig. 5. the apparatus 1 of the first embodi- 

40 ment has a user interface unit 1 1 , a syntax analysis unit 
12, a query processing unit 13, an HTML document 
access unit 14, an HTML document meta data storing 
unit 15, and an HTML document meta data managing 
unit 16. The query processing unit 13 has a query item 

45 finding unit 1 31 , a query conversion unit 1 32, a conver- 
sion function library 133, an HTML document process- 
ing unit 134, and a retrieval result conversion unit 135. 
[0060] The user interface unit 1 1 receives a search 
request (query statement) consisting of search items 

so and search conditions entered by a user through an 
application program 3. The syntax analysis unit 12 ana- 
lyzes the syntax of the query statement received by 
user interface unit 1 1 The query processing unit 1 3 col- 
lectively retrieves required information items involved in 

55 HTML documents. More precisely, the query item find- 
ing unit 131 finds locations of items specified in the 
query statement The query conversion unit 132 con- 
verts each user input domain in the query statement 
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into a corresponding local domain and forms queries to 
be transmitted from the HTML document access unit 
1 4. The HTML document access unit 1 4 receives HTML 
documents that are returned in response to the queries. 
The HTML document processing unit 134 acquires 
information from the received HTML documents and 
processes the information according to the query state- 
ment. For example, the HTML document processing 
unit 1 34 selects information pieces corresponding to the 
search items, filters the selected intonation pieces 
according to the search conditions, and provides a 
search result The retrieval result conversion unit 135 
converts local domains in the retrieval result into user 
output domains. The HTML document access unit 14 
collects HTML documents scattering over open net- 
works and converts information contained in the HTML 
documents into information of a unified form such as a 
table. The HTML document access unit 1 4 is connected 
to HTML document servers 2-1 , 2-2, and the like. Each 
of the HTML document servers has HTML documents 
21 and a Web server 22 that manages the HTML docu- 
ments 21. The HTML document meta data storing unit 
15 stores meta data about the HTML documents. The 
meta data includes the document structure, presenta- 
tion styles, items, etc., of each HTML document to be 
retrieved. Items information in a partial structure such 
as a table in a given HTML document frequently disa- 
gree with items stipulated in a search request in a one- 
to-one manner. In this case, the meta data relates the 
plural elements of which each one corresponds to the 
partial structure to the item in a search request Note 
that an element is information piece contained in HTML 
document hereinafter. The HTML document meta data 
manager 16 stores new meta data in the storing unit 15 
and deletes and changes the meta data in the storing 
unit 15. The HTML document meta data manager 16 is 
implemented in, for example, an editor and is controlled 
by a system manager. 

[0061] Fig. 6 shows the structure of table of the HTML 
document meta data storing unit 15. The HTML docu- 
ment storing unit 15 stores meta data in the form of 
tables. An HTML document table 151 stores the loca- 
tions of HTML documents. An HTML document to table 
mapping table 152 stores data used to convert ele- 
ments contained in the HTML documents into items 
forming a table. An HTML document item table 153 
stores the attributes of items contained in the HTML 
documents for each hem. A domain table 154 stores the 
presentation styles of domains. A user domain table 
1 55 stores the input and output domains of each user. A 
domain conversion function table 156 stores domain 
conversion functions. 

[0CS2] Processing steps carried out by the apparatus 
1 of the first embodiment will be explained. The 
processing steps are carried out in two phases, i.e., a 
preparatory phase of Fig. 7 and an execution phase of 
Fig. 8. In the preparatory phase, a managing person 
prepares meta data about HTML documents through 



the HTML document meta data manager 16 before 
starting the execution phase. 
[0063] In the preparatory phase of Fig. 7. step S100 
stores the locations of HTML documents in the HTML 

5 document table 151. Step S1 1 0 sets, in the HTML doc- 
ument to table mapping table 1 52, data used to convert 
elements contained in the HTML documents into a table 
consisting of items. Step S120 sets, in the item table 
153, the attributes of items contained in the HTML doc- 

10 uments. Step S1 30 sets, in the domain table 1 54, local 
domains of the items contained in the HTML docu- 
ments. Step S140 sets, in the user domain table 155, 
the input and output domains of each user. Step S145 
checks to see If there are sufficient conversion functions 

is for converting a given domain into another. If not, step 
S150 prepares necessary domain conversion functions 
and stores them in the domain conversion function table 
156. 

[0064] The execution phase of Fig. 8 will be explained. 

20 In step S200, the syntax analysis unit 12 analyzes the 
syntax of a query statement entered by a user, and the 
query Hem finding unit 131 finds the locations of search 
Hems specified by the user in the HTML document table 
151 . In step S210. the query Hem finding unH 131 finds 

25 HTML documents that have all of the search items in 
the HTML document Hem table 153. In step S220, the 
query conversion unH 1 32 gets user input domains, user 
output domains, and local domains corresponding 
found items from the tables 154 and 155. In step S225, 

30 the query conversion unit 132 checks to see if the user 
input domains and local domains of the search items 
agree with each other. If they do not agree in an Hem, 
the query conversion unH 132 gets a domain conversion 
function for the item from the domain conversion func- 

35 tion table 1 56 and converts the user input domain of the 
Hem into a corresponding local domain with respect to 
the items whose domain differs as described above in 
step S230. In step S240, the HTML document process- 
ing unit 134 gets HTML documents through the HTML 

40 document access unit 14, extracts Hems for the search 
Hems from the HTML documents, and prepares a 
search resuH. In step S245, the HTML document 
processing unH 134 checks to see if the user output 
domain and local domain of each Hem agree with each 

45 other. If they do not agree in an item, the HTML docu- 
ment processing unH 134 gets a domain conversion 
function for the item from the domain conversion func- 
tion table 156 and converts the local domain of the item 
into a corresponding user output domain with respect to 

so the items whose domain differs as described above in 
step S250. In step S260, the search resuH having 
proper user output domains is supplied to the user 
through the user interface unH 1 1 . 
[0065] The details of the process procedure of the first 

55 embodiment will be explained wHh reference to Figs. 9 
to 16. 

[0066] Figure 9A shows an exemplary display on a 
Web browser of an HTML document concerning with 
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product information of a shop A, ami Fig. 10A shows 
that of a shop B. Figure 9B shows an HTML description 
that provides the display of Rg. 9A, and Fig. 10B shows 
an HTML description that provides the display of Fig. 
10A. 

[0067] The shop A employs a tag TABLE to form a 
table to show their product information. The shop B 
employs a tag OL to form a clause of their product infor- 
mation. 

[0068] The shop A displays each price with the with-¥ 
presentation style, and the shop B shows each price 
with the with-yen presentation style. 
[0069] The shop A has a product name as an element, 
and the shop B has a maker name and a product name 
as elements. 

[0070] The location of the product information of the 
shop A is a URL of http^/www.shop-a.co.jp/kxod- 
ucts.htmi, and that of the shop B is a URL of 
httpy/www.shop-b.co.jp/shouhin.html. 
[0071 ] In this way, the HTML documents of Figs. 9A 
and 9B have different document structures, presenta- 
tion styles, and elements. 

(1) Preparatory phase 

[0072] Step S100 of Fig. 7 sets the locations of the 
HTML documents in the document table 151. In this 
example, the locations are page names and URLs as 
shown in Fig. 11. 

(a) Shop A 

Page name: Shop-A 

URL: http://www.shop-a.co.jp/products.html 

(b) Shop B 

Page name: Shop-B 

URL: http://www.shop-b.co.jp/shouhin.html 

[0073] Step S110 sets data for converting elements 
contained in the HTML documents into a table in the 
HTML document to table mapping table 152. In this 
example, page names, record start points, and ways of 
extracting columns 1 to 4 are set as shown in Fig. 12. 
For the prices of the shop B, only numerals and the 
positions including y are picked up. 

(a) Shop A 

Page name: Shop-A 

Record start: line starting with ( TR >< TD ) 
Column 1 : "Shop A" fixed 
Column 2: between 1st <TD>and 1st T in 
record start line 

Column 3: between 1st T and 1st (/TD) in 
record start line 

Column 4: between 2nd (TD)and 2nd </TD>in 



record start line 
(b)ShopB 

s Page name: Shop-B 

Record start: line starting with < L1 > 

Column 1 : "Shop.B" fixed 

Column 2: between 1st <L1>and 1st T in 

record start line 
10 Column 3: between 1 st T and 2nd T in record 

start line 

Column 4: between 2nd T and 1st "yen" in 
record start line 

is [0074] Step 120 stores the attributes of the items 
involved in the HTML documents in the HTML docu- 
ment item table 153. In this example, the page names, 
corresponding columns, column titles, and data types 
are stored as shown in Fig. 13. Only price information is 

20 defined as a numeric value in data type. Values of this 
data type are used for comparison when processing the 
search conditions. 

(a-1) Page Shop-A, column 1 

25 

Page name: Shop-A 
Column: column 1 
Column title: shop name 
Data type: character string 

30 

(a-2) Page Shop-A, column 2 

Page name: Shop-A 
Column: column 2 
35 Column title: maker name 

Datatype: character string 

(a-3) Page Shop-A, column 3 

40 Page name: Shop-A 

Column: column 3 
Column title: product name 
Data type: character string 

45 (a-4) Page Shop-A, column 4 

Page name: Shop-A 
Column: column 4 
Column title: price 
so Data type: numeric value 

(b-1) Page Shop-B, column 1 

Page name: Shop-B 
55 Column: column 1 

Column title: shop name 
Data type: character string 
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(b-2) Page Shop-B, column 2 

Page name: Shop-B 
Column: column 2 

Column title: maker name s 
Data type: character string 



Accordingly, mutual conversion functions between the 
user input domains and the local domains and between 
the user output domains and the local domains are set 
as follows and are stored in the domain conversion 
function table 156. These conversion functions are also 
stored in the conversion function library 133. 



(b-3) Page Shop-B, column 3 

Page name: Shop-B 
Column: column 3 
Column title: product name 
Data type: character string 

(b-4) Page Shop-B, column 4 

Page name: Shop-B 
Column: column 4 
Column title: price 
Data type: numeric value 

[0075] Step S 1 30 sets local domain names for the ele- 
ments contained in the HTML documents in the domain 
table 154 as shown in Fig. 14. No local domains are set 
for the shop names, maker names, and product names 
of the shops A and B because they are represented with 
optional character strings. On the other hand, local 
domains for the product prices of the shops A and B are 
set as follows according to the value set in the HTML 
document item table 1 53. The local domain is registered 
in the HTML document item table 153. 

Domain group: price 

Local domain of Shop-A: with-¥ presentation style 
Local domain of Shop-B: value-comma presenta- 
tion style 

[0076] Step S1 40 sets user input and output domains 
for each user in the user domain table 155 as shown in 
Fig. 15. A user A enters a shop name, maker name, and 
product name in HTML presentation styles and 
requests a search output in the same presentation 
styles, and therefore, no user input and output domains 
for these items are set For a price domain group, 
assume that the user A requests as follows: 
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40 



45 



(a) Conversion from value-comma presentation 
style into with-yen presentation style 

Conversion function name: Num2YenO 
Conversion input domain: value-comma pres- 
entation style 

Conversion output domain: with-yen presenta- 
tion style 

(b) Conversion from with-yen presentation style into 
value-comma presentation style 

Conversion function name: Yen2Num() 
Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: value-comma pres- 
entation style 

(c) Conversion from value-comma presentation 
style into with-¥ presentation style 

Conversion function name: Num2¥0 
Conversion input domain: value-comma pres- 
entation style 

Conversion output domain: with-¥ presentation 
style 

(d) Conversion from with-¥ presentation style into 
value-comma presentation style 

Conversion function name: ¥2Num() 
Conversion input domain: with-¥ presentation 
style 

Conversion output domain: value-comma pres- 
entation style 

(e) Conversion from with-yen presentation style into 
with-¥ presentation style 



Input: with-yen presentation style 
Output: with-yen presentation style 

[0077] This domain is registered in the domain table so 
154, and the user domain is registered in the user 
domain table 155. The user domain may contain differ- 
ent user input and output domains. 
[0078] Step S1 50 sets domain conversion functions in 
the domain conversion function table 156 as shown in ss 
Fig. 16. In this example, there are three domains includ- 
ing the value-comma presentation style, with-yen pres- 
entation style, and wittv¥ presentation style. 



Conversion function name: Yen2¥0 
Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: with-¥ presentation 
style 

(f) Conversion from with-¥ presentation style into 
with-yen presentation style 

Conversion function name: ¥2YenO 
Conversion input domain: with-¥ presentation 
style 
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Conversion output domain: with-yen presenta- 
tion style 

(2) Execution phase 

[0079] The user A issues a search request consisting 
of, for example, a query statement containing search 
item and search condition: 

Search items: shop name, maker name, product 
name, and price 

Search conditions: price < 200.000 yen 

[0080] The syntax analysis unit 1 2 analyzes the query 
statement entered by the user. In step S200 of Fig. 8, 
the query item finding unit 131 finds the search items. 
The search items are the shop name, maker name, 
product name, and price. The query item finding unit 
131 finds the column titles corresponding to the search 
items in the HTML document item table 153 and pro- 
vides the following records: 

(a) Shop name 

Page Shop-A. column 1 . data type of character 
string 

Page Shop-B. column 1 . data type of character 
string 

(b) Maker name 

Page Shop-A, column 2, data type of character 
string 

Page Shop-B, column 2. data type of character 
string 

(c) Product name 

Page Shop-A, column 3. data type of character 
string 

Page Shop-B, column 3, data type of character 
string 

(d) Price 

Page Shop-A, column 4, data type of numeric 
value 

Page Shop-B, column 4, data type of numeric 
value 

[0081] In step S210, the query item finding unit 131 
finds the names of HTML documents that contain all of 
the search items and provides the following two combi- 
nations. The URLs of the combinations are obtained 
from the HTML document table 151. 

(A) Combination 1 



(a) Page name: Shop-A 

(b) Elements 

Shop name: column 1 , character string 
5 Maker name: column 2. character string 

Product name: column 3, character string 
Price: column 4, numeric value 

(OURL 

10 http 7/www. shop-a. co.jp/products. html 

(B) Combination 2 

(a) Page name: Shop-B 
15 (b) Elements 

Shop name: column 1, character string 
Maker name: column 2, character string 
Product name: column 3, character string 
20 Price: column 4, numeric value 

(c) URL 

http7Awww.shop-b.co.jp/shouhin.html 

25 [0082] In step S220. the query conversion unit 132 
acquires user domains and local domains correspond- 
ing to the search items. The local domains are obtained 
from the HTML document item table 153. For any item 
having a local domain, a domain group is found in the 

30 domain table 154, and user domains of the same 
domain group are retrieved from the user domain table 
155. As a result, the following combinations are 
obtained: 

35 (A) Combination 1 

(a) Page name: Shop-A 

(b) Elements 

40 Shop name: no local domain 

Maker name: no local domain 
Product name: no local domain 
Price: local domain of with-V presentation 
style 

45 

user input domain of with-yen presen- 
tation style 

user output domain of with-yen pres- 
entation style 

50 

(B) Combination 2 

(a) Page name: Shop-B 

(b) Elements 

55 

Shop name: no local domain 
Maker name: no local domain 
Product name: no local domain 



13 



25 



EP0964 341 A2 



26 



Price: local domain of value-comma pres- 
entation style 

user input domain of with-yen presen- 
tation style 5 
user output domain of with-yen pres- 
entation style 

[0083] For any item having different user input and 
local domains, the query conversion unit 132 gets a 10 
domain conversion function having corresponding con- 
version input and output domains and converts the user 
input domain into a local domain in step S230. In each 
of the above-mentioned combinations, the user input 
domain differs from the local domain in the price pres- is 
entation style. Accordingly, proper domain conversion 
functions are fetched from the domain conversion func- 
tion table 156 with the conversion input and output 
domain names serving as keys. 

20 

(A) Combination 1 



Search conditions: price < ¥200,000 
(B) Combination 2 

(a) Page name: Shop-B 

(b) Search request 

Search items: shop name, maker name, 

product name, and price 

Search conditions: price < 200,000 

[0086] With these queries, the HTML document 
access unit 14 acquires the HTML documents and gen- 
erates a search result in step S240. The HTML docu- 
ment processing unit 134 extracts information from the 
HTML documents located at obtained URL and linked 
URL according to the HTML document to table mapping 
table 1 52, fitters the information if there are search con- 
ditions, and provides the following search result: 

(A) Combination 1 



Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: wrth-¥ presentation 25 
style 

Conversion function name: Yen2¥0 
(B) Combination 2 

30 

Conversion input domain: with-yen presenta- 
tion style 

Conversion output domain: value-comma pres- 
entation style 

Conversion function name: Yen2NumO 35 

[0084] The conversion functions are executed for the 
combinations 1 and 2 to obtain the following: 

(A) Combination 1 40 
Yen2¥(200,000 yen) = ¥200,000 

(B) Combination 2 

45 

Yen2Num(200.000 yen) = 200,000 

[0085] The query conversion unit 132 generates the 
following queries for the HTML document access unit 
14: so 



(a) Page: Shop-A 

(b) Search result 

Shop name: Shop A, maker name: Maker 

A. product name: PC1 , price: ¥170.000 
Shop name: Shop A. maker name: Maker 

B. product name: PC101 , price: ¥198.000 

(B) Combination 2 

(a) Page: Shop-B 

(b) Search resutt 

Shop name: Shop B, maker name: Maker 
A, product name: PC1, price: 168,000 

[0087] If there is any item having different user output 
domain and local domain, the retrieval result conversion 
unit 135 acquires a corresponding domain conversion 
function and converts the local domain into a proper 
user output domain in step S250. In each of the above- 
mentioned combinations, the local domain and user 
output domain of the price differ from each other, and 
therefore, the retrieval result conversion unit 135 
searches the domain conversion function table 1 56 for a 
proper conversion function according to conversion 
input and output domains stored in the domain conver- 
sion function table 156. 



(A) Combination 1 

(a) Page name: Shop-A 

(b) Search request ss 

Search items: shop name, maker name, 
product name, and price 



(A) Combination 1 

Conversion input domain: with-¥ presentation 
style 

Conversion output domain: with-yen presenta- 
tion style 

Conversion function name: ¥2YenQ 
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(B) Combination 2 

Conversion input domain: value-comma pres- 
entation style 

Conversion output domain: with-yen presenta- 
tion style 

Conversion function name: Num2Yen() 

[0088] The conversion functions are executed to 
obtain the following: 

(A) Combination 1 

¥2Yen(¥1 70.000) = 170,000 yen 
¥2Yen(¥198,000) = 198.000 yen 

(B) Combination 2 

Num2Yen(1 68.000) = 168.000 yen 

[0089] In the last the user interface unit 1 1 provides 
the user with the following search result in step S260: 

Shop name: Shop A, maker name: Maker A, prod- 
uct name: PC1. price: 170.000 yen 
Shop name: Shop A. maker name: Maker B. prod- 
uct name: PC101. price: 198.000 yen 
Shop name: Shop B, maker name: Maker A. prod- 
uct name: PC1, price: 168,000 yen 

[0090] As explained above, the first embodiment man- 
ages meta data about information contained in HTML 
documents scattering over open networks, to realize 
collective search on the information contained in the 
plural HTML documents and generate a search result 
without regard to differences among the HTML docu- 
ments. The first embodiment manages information doc- 
ument by document. If an HTML document to be 
searched is added, corrected, or deleted, the first 
embodiment simply adds, corrects, or deletes the HTML 
document only itself. The first embodiment easily han- 
dles an exponentially increasing number of HTML doc- 
uments as search objects. 

[0091] Search result from each HTML document is 
obtained as Hem data being conditionally processed 
item by item. Therefore, HTML document processing 
unit 134 may merge plural search results from plural 
HTML documents so as to prepare one piece of search 
result, and filter this search result as a whole rf neces- 
sary. 

[0092] HTML documents scattering over open net- 
works have different document structures, elements, 
presentation styles, etc. Even with these variations, the 
first embodiment is capable of retrieving required infor- 
mation from the different HTML documents, converting 
the retrieved information into a unified form for each 
user, and returns a collective search result to the user. 
Compared with the prior arts, the first embodiment elim- 



inates the time and labor of manual work and drastically 
improves search efficiency. The first embodiment is 
applicable to electronic commerce in flexibly retrieving 
product information with search conditions of, for exam- 
5 pie, the names and prices of shops that offer lowest 
prices for a given product. Consequently, the first 
embodiment contributes to vitalize fair electronic com- 
merce. 

io Second embodiment 

[0093] An Internet information integrated retrieval 
apparatus of the second embodiment according to the 
present invention concerning semi-structured document 
is information retrieval scheme will be explained with refer- 
ence to Figs. 17 to 38. 

[0094] Open networks including the Internet involve 
search engines having specific input forms. The second 
embodiment retrieves necessary information with 

20 search conditions from the open networks through plu- 
ral search engines irrespective of differences in the doc- 
ument structures, essential input items, and 
presentation styles of the search engines and collec- 
tively acquires a search result from the search engines. 

25 [0095] The second embodiment employs the same 
concept and terms as the first embodiment As 
explained above, HTML documents employ various 
presentation styles depending on their writers and 
users. For example, some HTML documents express 

30 Kanagawa prefecture, an area in Japan, as "Kanagawa- 
ken" and others simply as "Kanagawa." 
[0096] "Kanagawa-ken n is a domain of a with-ken 
presentation style when expressing an area. "Chinese 
food" is a domain of a with-food presentation style when 

35 expressing a genre. The area and genre form each a 
domain group. If a user enters a query statement with 
"Kanagawa-ken" and "Chinese food," this query state- 
ment involves user input domains of the with-ken pres- 
entation style for area and with-food presentation style 

40 for genre. If a search output for a user has "Kanagawa- 
ken" and "Chinese food," this search output includes 
user output domains of the with-ken presentation style 
for area and with-food presentation style for genre. If a 
search result extracted from an HTML document 

45 includes "Kanagawa-ken," this search result involves a 
local domain of the with-ken presentation style for area. 
[0097] If a given domain group involves different user 
input domain, user output domain, and local domain, 
the second embodiment resolves the difference by 

so using domain conversion functions like the first embodi- 
ment. 

[0098] Figure 1 7 shows the Internet information inte- 
grated retrieval apparatus 10 according to the second 
embodiment. This second embodiment is a modification 
55 of the first embodiment to replace the query processing 
unit 13 of Fig. 15 an integrated retrieval unit 130. The 
integrated retrieval unit 130 additionally has an essen- 
tial item finding unit 136, a retrieval pattern judging unit 
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137, and a retrieval result processing unit 138. The 
apparatus 1 0 has a user interlace unit 1 1 , a syntax anal- 
ysis unit 12, the integrated retrieval unit 130, an HTML 
document meta data storing unit 150, an HTML docu- 
ment meta data manager 160, and an HTML document 
access unit 14. The integrated retrieval unit 130 accord- 
ing to the second embodiment has a query item finding 
unit 131, a query conversion unit 132, a conversion 
function library 133, the essential item finding unit 136, 
the retrieval pattern testing unit 137, the retrieval result 
processing unit 138, and a retrieval result conversion 
unit 135. 

[0099] The same parts as those of the first embodi- 
ment shown in Fig. 5 are represented with like reference 
marks if not specifically mentioned, and their explana- 
tions are not repeated. The user interface unit 11 
receives a query statement entered by a user through a 
user application program 3. The query statement con- 
sists of search items and search conditions. The syntax 
analysis unit 12 analyzes the syntax of the query state- 
ment received by the user interface unit 11. The inte- 
grated retrieval unit 130 collectively retrieves required 
information involved in HTML documents that are man- 
aged by search engines for the search items. More pre- 
cisely, the query Hem finding unit 131 finds the location 
of the search items in HTML documents indicated in the 
query statement. The essential item finding unit 136 
checks scarce items in the input forms of search 
engines and determines search engines to use. The 
retrieval pattern judging unit 137 determines an opti- 
mum search pattern for the query statement and opti- 
mizes the search statement for the search engines 
accordingly. The query conversion unit 132 converts 
user input domains in the query statement into local 
domains and prepares queries to be transmitted by the 
HTML document access unit 14 to the search engines 
retrieval. The retrieval result processing unit 138 proc- 
esses information contained in the acquired HTML doc- 
uments according to the query statement (e.g., 
selecting items for search items and filtering data for 
search condition). The retrieval result processing unit 
1 38 f ilters the information extracted from the HTML doc- 
uments and suppresses conditional processes carried 
out by the search engines. The retrieval result conver- 
sion unit 1 35 converts local domains with respect to the 
presentation style of retrieved items in the output of the 
retrieval result processing unit 138 into user output 
domains. The HTML document access unit 14 transmits 
the prepared queries to the search engines and 
acquires HTML documents scattering over open net- 
works through the search engines. The second embod- 
iment converts information contained in the acquired 
HTML documents into a unified form such as a table 
appropriate for the user. The HTML document access 
unit 14 is connected to search engines 20-1 , 20-2, and 
the like through a communication network 190. Each of 
the search engines consists of an engine unit 23 and a 
database 24. The HTML document meta data storing 



unit 150 stores information for each search engine such 
as the locations of the search engines and the docu- 
ment structures, presentation styles, and elements of 
HTML documents. The HTML document meta data 

5 manager 160 adds, deletes, and changes meta data in 
the HTML document storing unit 150. The HTML docu- 
ment meta data manager 160 is implemented in, for 
example, an editor, to control the registration and man- 
agement of the meta data in the HTML document stor- 

io ing unit 150. 

[01001 Fig. 18 shows the details of the HTML docu- 
ment meta data storing unit 150. The unit 150 stores 
meta data in the form of tables like the meta data storing 
unit 15 of Fig. 6. An HTML document table 151 stores 

is the locations of HTML documents. An HTML document 
to table mapping table 152 stores data for converting 
elements contained in each HTML document into a 
table consisting of items. An HTML document item table 
153 stores the attribute of each item contained in each 

20 HTML document. A domain table 154 stores the pres- 
entation styles of domains. A user domain table 155 
stores the input and output domains of each user. A 
domain conversion function table 156 stores domain 
conversion functions. An essential item table 1 57 stores 

25 essential input items of the input form of each search 
engine. The retrieval pattern judging unit 137 has a 
retrieval pattern matrix table of Fig. 30 used to deter- 
mine a retrieval pattern for a given search engine and 
optimizes a user query statement for the search engine. 

30 The retrieval pattern matrix table 139 of Fig. 30 may be 
stored in the meta data storing unit 150. 
[01 01 ] The details of operation of the apparatus 1 0 of 
the second embodiment and the details of the setting of 
contents for the tables will be explained. The operation 

35 is carried out in two phases, i.e., a preparatory phase of 
Fig. 21 preparing data such as presentation style before 
retrieval and an execution phase of Fig. 3 1 . 
[0102] Figs. 19A, 19B, and 19C show examples of 
input forms of search engines. Figure 20 shows an 

40 HTML description corresponding to the input form of 
Fig. 19B. 

(1) Preparatory phase 

45 [01 03] Fig. 21 shows steps carried out in the prepara- 
tory phase. Step S300 sets the HTML document item 
table 153 as shown in Fig. 22. HTML document item 
table 153 manages following items for each input form 
of the search engine. A column "Page name" contains 

so the names of input forms of the search engines. A col- 
umn titled "Column" contains column numbers related 
to the HTML document mapping table 152. A column 
"Item name" contains items contained in the input forms 
of the search engines. A column "Availability" contains 

55 data to indicate whether or not the data items are 
obtainable from the retrieval result of the corresponding 
search engines. A column "Conditional" contains data 
to indicate whether or not the data items are condition- 
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ally processable by the corresponding search engines. 
A column "Data type" contains data to indicate whether 
each data item is a numeric value or a character string 
and is used when evaluating and filtering information. A 
column "Name tag" contains a NAME-tag if a corre- 5 
spending data item employs a selection form. A column 
"Local domain" contains local domains for correspond- 
ing column numbers. 

[0104] Step S310 sets the HTML document table 151 
as shown in Fig. 23. The HTML document table 151 10 
manages the locations of the input forms of the search 
engines. A column "Page name" contains the names of 
the input forms of the search engines. A column 
"Search engine URL" contains URLs serving as loca- 
tion information of the search engines. is 
[0105] Step S320 sets the HTML document to table 
mapping table 152 as shown in Fig. 24. The HTML doc- 
ument to table mapping table 152 maps information 
contained in HTML documents returned by the search 
engines to a table. A column "Page name" contains the 20 
names of the input forms of the search engines. A col- 
umn "Record start" contains tags that indicate each 
start line of contents in a corresponding HTML docu- 
ment. Columns titled "Column 1" to "Column 5" contain 
each tags that indicate a portion corresponding to a 
data item to be retrieved in each obtained HTML docu- 
ment. The column titles "Column 1" to "Column 5" of 
Fig. 24 correspond to the columns 1 to 5 listed in the 
column titled "Column" of the HTML document item 
table 153 for page-A shown in Fig. 22. Step S330 sets 
the domain table 154 as shown in Fig. 25. The domain 
table 154 manages domain groups and the domains as 
local domains information set in the HTML document 
item table 153. 

[01 06] Step S340 sets the domain conversion function 
table 156 as shown in Fig. 26. The domain conversion 
function table 156 manages domain conversion func- 
tions. A column "Conversion function name" contains 
the name of each function for converting a specific 
domain into another domain. A column "Domain group" 
contains each group of domains of the same kind. A col- 
umn "Conversion input domain" contains each input 
domain for each domain conversion function. A column 
"Conversion output domain" contains each output 
domain for each domain conversion function. A column 
"Library name" contains the name of file of the conver- 
sion function library 133. 

[0107] Step S350 sets the user domain table 155 as 
shown in Fig. 27. The user domain table 155 manages 
the input and output domains indicated by each user per 
domain group. A column "User name" contains the 
name of each user that issues a search request A col- 
umn "User input domain" contains user input domains 
used by the users for certain domain group. A column 
"User output domain" contains user output domains 
used by the users for each domain group. 
[01 08] Step S360 sets the essential item table 1 57 as 
shown in Fig. 28. Input form of some search engine has 



essentia) items to be filled in. The essential item table 
157 manages such essential items. A column "Page 
name" contains the names of the input forms of the 
search engines. A column "Essential item" contains 
essential items that must be filled in. 

(2) Execution phase 

[0109] Figure 31 shows steps carried out in the exe- 
cution phase of the second embodiment. 
[01 1 0] For example, a user wants to know the names 
and telephone numbers of Japanese food restaurants in 
Kanagawa prefecture. For this, a search request is 
made with simple syntax query statement ah SQL state- 
ment containing SELECT and WHERE clauses. 
[0111] In step S4O0, the user interface unit 11 
receives the query statement. The user who made the 
query is the user 1 shown in Fig. 27, and search items 
are "Shop name" and "Phone number" with search con- 
ditions of "area ° Yokohama city" and "genre = Japa- 
nese food." The query statement is as follows: 
[01 1 2] SELECT Shop name, phone number WHERE 
area = "Yokohama city" and genre = "Japanese food" (1 - 
1) 

[0113] In step S410, the query item finding unit 131 
refers to the HTML document item table 153 of Fig. 22 
and finds search engines that have the data items cor- 
responding to the search items and conditions. Figure 
32 shows the search engines thus found. 
[0114] In step S420, the query item finding unit 131 
refers to the document table 151 according to the result 
of step S410 and specifies pages that have the items 
"Shop name," "Phone number," "Area," and "Genre." 
Then, the search engines of Page-A, Page-B, and 
Page-C are selected. 

[01 15] In step S430, the essential item f inding unit 136 
refers to the essential item table 157 of Fig. 28, checks 
the essential items of the search engines, and narrows 
the search engines to be used. Some search engines 
have essential items to be filled in. Thus, among the 
search engines in found location provided by step S420, 
the essential item finding unit 136 exclude search 
engine that has essential item except for the indicated 
item as search condition. The query statement (1-1 ) has 
the conditional items of "Area" and "Genre." In connec- 
tion with them, the search engine of Page-A has an 
essential input item "Genre" that agrees with the search 
condition item "Genre." Accordingly, the search engine 
of Page-A is adaptable. The search engine of Page-B 
has an essential input item "Area" that corresponds to 
the search condition item "Area," and therefore, the 
search engine of Page-B is also adoptable. The search 
engine of Page-C has essential input items "Area" and 
"Genre." and therefore, is adoptable. 
[01 16] On the other hand, assume that query state- 
ment as follows is entered: 

SELECT shop name, phone number WHERE area 
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= "Yokohama city" (1-2) 

[0117] tn this case, in the query item Taxiing unite 131 
Page- A, Page-B, Page-C are selected as search engine 
in found location referring to the HTML document item 
table 152, while these three engine have items "shop 
name", "phone number" and "area". 
[0118] Next, in the essential item finding unit 136 
selected search engines by the query item finding unit 
131 are narrowed as follows. 
[0119] Page-A set genre as essential item. It means 
designation for item "genre" is essential for retrieval for 
Page-A, so that retrieval from Page-A fails unless genre 
is designated. Genre is not designated in the search 
condition, i.e., where clause in the query statement (1- 
2), accordingly the essential item finding unite 136 
excludes Page-A among candidates. 
[0120] Page-C set both genre and are as essential 
item, so that Page-C is excludes among candidates. 
[0121] On the contrary, Page-B set area as essential 
item, the "area" is designated in where clause, so that 
Page-B is selected as a search engine to be retrieved. 
[0122] Note that, when transmitting the above query 
statement (1-2) to a search engine that does not have 
essential item, the search engine may be searched 
even if "area" is designated in where clause, as the 
search engine (page) does not handle essential condi- 
tional item. Accordingly, the essential item finding unit 

1 36 selects the search engine as a search engine to be 
retrieved. 

[0123] Returning to the query statement (1 -1 ), at this 
time, the following SQL statements according to the 
query statement (1-1) are prepared for the selected 
search engines: 

Page-A: 

SELECT shop name, phone number WHERE 
area = "Yokohama city" and genre = "Japanese 
food" (2-1) 

Page-B: 

SELECT shop name, phone number WHERE 
area = "Yokohama.city" and genre = "Japanese 
food" (2-2) 

Page-C: 

SELECT shop name, phone number WHERE 
area = "Yokohama city" and genre = "Japanese 
food" (2-3) 

[0124] In step S440. the retrieval pattern judging unit 

137 refers to the retrieval pattern matrix of Fig. 30 and 
determines retrieval methods. The retrieval pattern 
matrix will be explained. Figure 29 shows a simplified 
relationship between the apparatus of the second 



embodiment and search engines. There are three 
retrieval patterns (a), (b), and (c) for processing a 
search request entered by a user. The pattern (a) 
returns the search request to the user without process- 
5 ing it The pattern (b) conditionally processes the 
search request by the search engines. The pattern (c) 
processes the search request by the search engines 
and filters the process result by the apparatus 10 of the 
second embodiment The retrieval pattern matrix of Fig. 
10 30 is used to select one of the three patterns for each 
search item in a given query statement. The retrieval 
pattern judging unit 137 refers to the retrieval pattern 
matrix and determines retrieval strategies. In Fig. 30, a 
column "Item" under a title "Search request" contains 
is each item to retrieve specified by, for example, a 
SELECT clause in an SQL statement A column "Condi- 
tion" under the "Search request" contains each search 
condition specified by, for example, a WHERE clause in 
the SQL statement. A column "Item" under a title 
20 "Search engine" contains each item returned by a 
search engine as a retrieval result A column "Condi- 
tion" under the "Search engine" contains each condition 
set in a search request and stipulated in the input form 
of each search engine. The column "Hem" under the 
25 "Search engine" corresponds to the column "Availabil- 
ity" in the HTML document item table 153 of Fig. 22. and 
the column "Condition" under the "Search engine" cor- 
responds to the column "Conditional" in the HTML doc- 
ument item table 153. A column "Return as it is" 
30 contains data to indicate whether or not a search condi- 
tion value is returned as it is without processing a 
search item. A column "Return from search engine" 
contains data to indicate whether or not a result pro- 
vided by a search engine for a given search item is 
35 returned as it is. A column "Process by search engine" 
contains data to indicate whether or not a given search 
condition is processed by a search engine. A column 
"Filtering" contains data to indicate whether or not a 
retrieval result returned from a search engine with 
40 respect to a given search condition is processed by the 
retrieval result processing unit 138 of the apparatus 10. 
[0125] For exanple, the search statement (1-1) stipu- 
lates "Shop name" with the SELECT clause but not with 
the WHERE clausa The item "Shop name" is "o" in 
45 "Item" and "x" in "Condition" in "Search request" of Fig. 
30. Referring to the HTML document item table 153 of 
Fig. 22, the input form of the search engine Page-A of 
Fig. 19A is capable of receiving "Shop name" as a 
search condition and returning it as a search result. 
so Accordingly, the search engine of Fig. 19A is "o" in each 
of "Item" and "Condition" in Fig. 30. Namely, "Shop 
name" of the search engine of Fig. 19A corresponds to 
the fourth record from the top of Fig. 30. Accordingly, 
the process pattern of the Page-A for "Shop name" 
55 returns information provided by the search engine as an 
item without conditionally processing the information 
because a condition is not stipulated in SQL 
[0126] On the other hand, "Area" is specified in the 
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WHERE clause but not in the SELECT clause in the 
search statement (1-1). Accordingly, "Area" is V in 
"Kern*' and "o" in "Condition" in "Search request" of Fig. 
30. According to the HTML document item table 153 of 
Fig. 22, the Page-A of Fig. 19A is unable to receive a 
condition for "Area" but is able to return a search result 
for "Area." Accordingly, "Area" of the Page-A is "o" in 
"Item" and "x" in "Condition" in "Search engine" of Fig. 
30. As a result. "Area" of the Page-A corresponds to the 
eighth record from the top of Fig. 30. Namely, the proc- 
ess pattern of the Page-A for "Area" returns no informa- 
tion because it is not stipulated in the SELECT clause of 
the SQL statement, and the search engine is unable to 
carry out to conditional process. Instead, the retrieval 
result processing unit 1 38 carries out a filtering process 
to return a retrieval result. Similar processes are carried 
out for the Page-A on "Phone number" and "Genre" 
specified in the SQL statement (1-1), to derive a matrix 
of Fig. 33 from the matrix of Fig. 30. 
[0127] Namely, Fig. 33 shows a result of determina- 
tion of items and conditions to be set for the Page-A with 
respect to the search request. It is understood from a 
column "Process by search engine" that the search con- 
dition for "Genre" must be transmitted to the Page-A. It 
is understood from a column "Filtering" that a search 
result for "Area" from the Page-A must be filtered 
according to the condition set for "Area." It is under- 
stood from a column "Return from search engine" that 
"Shop name" and "Phone number" provided by the 
Page-A must be returned as they are to the user. 
[0128] The Page-A accepts search conditions for 
"Shop name" and "Genre," while the query statement 
(1-1) stipulates a search condition only for "Genre." 
Accordingly, "Japanese food" is set for "Genre" when 
sending a query to the Page-A. Thereafter, the retrieval 
result processing unit 1 38 carries out a filtering process 
to select data in the items "Shop name" and "Phone 
number" whose "Area" contains "Yokohama city" and 
prepares a retrieval result. Consequently, the pattern (c) 
is applied to the Page-A, and the query statement (2-1) 
is rewritten as follows: 

Filtering condition: "Area" = "Yokohama city" 
SELECT shop name, phone number WHERE 
genre = "Japanese food" (3-1) 

[01 29] Similarly, query statements for the Page-B and 
Page-C are prepared. Figure 34 shows a result of 
examination on the Page-B. It is understood from a col- 
umn "Process by search engine" that the search condi- 
tion for "Area" is transmitted to the Page-B. It is 
understood from a column "Filtering" that a search 
result provided by the Page-B is filtered according to the 
condition set for "Genre." It is understood from a column 
"Return from search engine" that information pieces to 
be provided by the Page-B for "Shop name" and "Phone 
number" are returned as they are to the user. Conse- 
quently, the pattern (c) is applied to the Page-B, and the 



query statement (2-2) is rewritten as follows: 

Filtering condition: "Genre" = "Japanese food" 
SELECT shop name, phone number WHERE area 
5 = "Yokohama city" (3-2) 

[01 30] Figure 35 shows a result of examination on the 
Page-C. It is understood from a column "Process by 
search engine" that the search conditions for "Area" and 

w "Genre" are transmitted to the Page-C. It is understood 
from a column "Filtering" that a search result provided 
by the Page-C is not filtered. It is understood from a col- 
umn "Return from search engine" that information 
pieces to be provided by the Page-C for "Shop name" 

is and "Phone number" are returned as they are to the 
user. Consequently, the pattern (b) is applied to the 
Page-C, and the query statement (2-3) is rewritten as 
follows: 

20 Filtering condition: none 

SELECT shop name, phone number WHERE area 
= "Yokohama city" and "Genre" = "Japanese food" 
(3-3) 

25 [01 31 ] In step S450 of Fig. 3 1 . the query conversion 
unit 1 32 converts the query statements provided by the 
retrieval pattern judging unit 137 into queries having 
local domains appropriate for the search engines. The 
query conversion unit 132 acquires user input domains 

30 and local domains for items whose local domain is set 
among items in a search engine corresponding to the 
specified item in search condition from the tables 153 
and 155, as shown in Fig. 36. For each item having dif- 
ferent user input domain and local domain, the query 

35 conversion unit 132 fetches a proper conversion func- 
tion from the conversion function library 133 according 
to the domain conversion function table 156 and con- 
verts the user input domain into a corresponding local 
domain. For example, the item "Area" in the Page-B has 

40 a local domain of "Page-B-City." A user input domain for 
this domain group is a domain ^vrth-city (SHITSUKI)" 
from the tables 154 and 155. Accordingly, the query 
conversion unit 132 refers to the domain conversion 
function table 156, fetches a conversion function 

45 "Sr^ValueBO," and converts "Yokohama city" into "07" 
that indicates the seventh entry in a selection list in the 
input form of the Page-B. 

[0132] The item "Genre" of the Page-C has a local 
domain of "Page-C-Dishes." A user input domain for this 

so domain group is a domain "with-food (RYOURITSUKI)" 
from the tables 154 and 155. As a result, the query con- 
version unit 132 refers to the domain conversion func- 
tion table 156, fetches a conversion function 
"Ryouri2ValueC0." and converts the "Japanese food" 

55 into "1 " that indicates the first entry in a selection list of 
the input form of the Page-C. 
[01 33] At this time, the queries for the search engines 
and filtering conditions for the retrieval result processing 
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unit 138 are as follows: 
Page- A: 

Filtering condition: "Area" = "Yokohama city" 
SELECT shop name, phone number WHERE 
genre = "Japanese food" (4-1 = 3-1) 

Page-B: 

Filtering condition: "Genre" = "Japanese food" 
SELECT shop name, phone number WHERE 
area = "07" (4-2) 



In the statement (4-2), the area "Yokohama 
city" has been changed to "07." 
Page-C: 

SELECT shop name, phone number FROM 
Page-C 

WHERE area = "Yokohama city" and genre = 
"1"(4-3) 

[0134] In the statement (4-3), the genre "Japanese 
food" has been changed to "1 ." 
[0135] In step S470 of Fig. 31 , the HTML document 
access unit 14 issues the following queries specific to 
the search engines according to the query statements 
prepared in step S460. Thereafter, the search engines 
carry out retrieval processes. 

Page- A: 

Filtering condition: "Area" = "Yokohama city" 
"GET http7/www. Page-a.co.jp/search- 

shop.cgi?category=Japanese food http/1 .0" (5- 
1) 

Page-B: 

Filtering condition: "Genre" = "Japanese food" 
"GET http://www.Page-b.co.jp/search- 
shop.cgi?area=07 http/1 .0" (5-2) 

Page-C: 

"GET http7/www. Page-c.co.jp/search- 

shop.cgi?area=Yokohama city & category=1 
http/1 .0" (5-3) 

[0136] In step S475. the search engines return data 
retrieved from HTML documents, and the retrieval result 
processing unit 138 extracts necessary information 
therefrom according to the HTML document to table 
mapping table 152. Figure 37A shows a display on a 
browser of the HTML document returned by the search 
engine of the Page-B, and Fig. 37B shows an HTML 



description corresponding to the (fisplay of Fig. 37A. 
Retrieval results provided by the search engines are as 
follows: 

5 (a) Page name: Page-A 

Filtering condition: "Area" = "\bkohama city" 
Retrieval result: 

10 Shop name: A1 , Area: \bkohama city 

Phone number: (045) **'-*-* 
Shop name: A2, Area: Yokosuka city 
Phone number: (0468) **-**** (6-1) 

15 (b) Page name: Page-B 

Filtering condition: "Genre" = "Japanese food" 
Retrieval result: 

20 Shop name: B1 , Genre: Japanese food 

Phone number: 045-***-**** 
Shop name: B2, Genre: Chinese food 
Phone number: 045-***-**** 
Shop name: B3, Genre: Chinese food 

25 Phone number: 045-***-**** (6-2) 

(c) Page name: Page-C 

Filtering condition: none 
30 Retrieval result: 

Shop name: C1, Phone number: 045-***- 
**** 

Shop name: C2, Phone number: 045-***- 
35 ****(6-3) 

[0137] In step S480, the retrieval result processing 
unit 138 finds any item that needs a filtering process 
according to the retrieval pattern matrix of Fig. 30. In 

40 step S490, the retrieval result processing unit 138 car- 
ries out the fltering process on the retrieval result of 
each search engine. In the example, the Page-A pays 
no attention to the condition "Area" = "Yokohama city" 
and the Page-B pays no attention to the condition 

45 "Genre" = "Japanese food." Accordingly, these retrieval 
results are filtered to extract data that satisfies "Area" = 
"Yokohama chy" and "Genre" = "Japanese food" as fol- 
lows: 

so (a) Page name: Page-A 

Filtering result 

Shop name: A1, Phone number: (045) ***- 
55 ****(7-1) 

(b) Page name: Page-B 
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Rltering result 



Shop name: B1, Phone number: 045-***- 
—(7-2) 

5 

(c) Page name: Page-C 

Rltering result 

Shop name: C1. Phone number: 045-"*- to 
•••• 

Shop name: C2, Phone number: 045-***- 
""(7-3 = 6-3) 



Information involved in HTML documents returned from 
plural search engines differ from one another in their 
document structure, presentation style, input form, etc., 
and therefore, search engines return results in various 
ways. The second embodiment resolves these differ- 
ences and provides a user with a search result in an 
integrated form its difference derives from that of search 
engines. The second embodiment improves search effi- 
ciency and reduces traffic in the networks. The second 
embodiment individually registers and manages the 
input forms of various search engines and easily con- 
trols meta data about HTML documents related to the 
search engines. 



[0138] In step S500, the retrieval result conversion 
unit 135 acquires the user output domains and local 
domains for the specified search items whose local 
domain is stipulated from the tables 153, 154 and 155, 
as shown in Rg. 38. For any item having different user 
output domain and local domain, the retrieval result 
conversion unit 135 converts the local domain into a 
corresponding user output domain according to a con- 
version function fetched from the domain conversion 
function table 156. For example, the item "Phone 
number" of the Page-A has a local domain and a user 
output domain that are identical to each other, and 
therefore, no conversion is carried out. The item "Phone 
number" of each of the Page-B and Page-C has a local 
domain Tel-Bar" and a user output domain Tel-Paren." 
As a result, the retrieval result conversion unit 135 
fetches a conversion function "Bar2ParenO" from the 
domain conversion function table 156 to convert "045- 
into "(045) The local domains of Page- 

B and Page-C are converted into user output domains 
as follows: 

Input: "045-***-****" (Domain: Tel-Bar) 
Domain conversion function: Bar2Paren() 
Output: "(045) (Domain: Tel-Paren) 

[0139] In step S510, the user interface unit 1 1 returns 
an collective search result prepared from above men- 
tioned retrieval result mentioned below to the user, and 
the application program 3 of the user displays the result 
in the form of, for example, a table. 

Shop name: A1 , Phone number: (045) 
Shop name: B1 , Phone number: (045) **♦-***♦ 
Shop name: C1 , Phone number: (045) 
Shop name: C2, Phone number: (045) ******** 

[0140] As explained above, the second embodiment 
prepares search requests for a plurality of search 
engines scattering over open networks by individually 
managing the objects of the input forms of the search 
engines, thereby resolving differences among the inter- 
face of the search engines and flexibly retrieving neces- 
sary information through the search engines. 



15 Third embodiment 

[0141] An HTML document information extraction 
apparatus of the third embodiment according to the 
present invention concerning semi-structured document 
20 information retrieval scheme will be explained with refer- 
ence to Figs. 39 to 53. 

[0142] The third embodiment retrieves information 
item by item from HTML documents scattering over 
open networks. This third embodiment is a modification 

25 of the first embodiment to form the HTML document 
processing unit 134 of the first embodiment of Rg. 5 
with a template analysis unit 1341, a URL-template 
table 1342. and a template processing unit 1343. The 
arrangement of Fig. 39 may singularly be achieved or 

30 may properly be combined with the arrangements of the 
first and second embodiments. For example, the 
arrangement of Rg. 39 may have the syntax analysis 
unit 12, item finding unit 131, query conversion unit 132, 
HTML document meta data storing unit 15, HTML doc- 

35 ument meta data manager 16. etc., of Rgs. 5 and 1 7. 
[01 43] To extract information item by item from HTML 
documents, the third embodiment manages the loca- 
tions and document structures of HTML documents for 
each HTML document. More precisely, the third embod- 

40 iment manages the locations of HTML documents by 
using URLs of the HTML documents. Its proxy informa- 
tion may be managed by using a proxy setting file 141 
that stores proxy server names and proxy port numbers 
related to the HTML documents. The document struc- 

45 tures of HTML documents include information of partial 
structures such as tables, lists and clauses contained in 
the HTML documents, that is, Herns to be extracted are 
delimited by delimiters such as tags and slashes, for 
example. The document structure information includes 

so the attributes of columns and data types for each items. 
The third embodiment stores and manages these docu- 
ment structures of HTML documents as item name, 
extraction text specifying part and data type of the item 
name etc.. in template files 1345. The data type of a 

55 given item may be a character or a numeric value and is 
used when processing data related to the item. The 
URL-template table 1 342 relates the template files 1 345 
to the URLs or file names of HTML documents to be 
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searched. Each HTML document is converted into a 
unified form such as a table according to extraction text 
specifying parts of a corresponding template file. The 
template files 1345 correspond to the HTML document 
to table mapping table 152 and HTML document Hern 
table 153 of Rgs. 6 and 18. 

[0144] When a user specifies a URL or a file name, 
the third embodiment refers to the proxy setting file 141 , 
URL-template table 1342, and template files 1345. For 
example, if a user specifies a URL, the third embodi- 
ment refers to the proxy setting file 1 41 to acquire a cor- 
responding HTML document name, refers to the URL- 
template table 1342 to acquire a template file name, 
scans the acquired HTML document one line or plural 
lines at a time from the top thereof, compares the 
scanned contents with extraction text specifying parts of 
the template file 1345, and extracts information item by 
item accordingly. At this time, the third embodiment 
checks to see if there is a link to the next page in the 
template file 1345. H there is, the third embodiment 
acquires the URL or file name of the next page and 
extracts data from the page. The third embodiment 
repeats these operations to completely read links. The 
third embodiment maps the extracted information to a 
table item by item by item watching referring to the tem- 
plate file 1 345. shapes the information according to data 
types stipulated in the template file 1345, and returns 
the names of the items from which the information has 
been extracted and the shaped and itemized informa- 
tion to the user Unlike the prior arts, the third embodi- 
ment optionally defines the data types of elements 
(information pieces) extracted from HTML documents 
so that conditionally processes the information pieces 
according to search conditions. Similar to the first and 
second embodiments, the third embodiment is capable 
of processing the presentation styles of information 
according to a user's request. 
[0145] Fig. 39 is a block diagram showing the HTML 
document information extraction apparatus according to 
the third embodiment 

[0146] In Fig. 39, the apparatus 100 of the third 
embodiment has a user interface unit 1 1 , an HTML doc- 
ument access unit 14, the proxy setting file 141, an 
HTML document processing unit 134, the template files 
1345, and a retrieval result conversion unit 135. The 
HTML document processing unit 134 has the template 
analysis unit 1341, URL-template table 1342, and tem- 
plate processing unit 1343. A user enters a query state- 
ment 301 through an application program 3. According 
to the query statement 301, the apparatus 100 
accesses HTML documents directly or through a proxy 
server 2, acquires information from the HTML docu- 
ments, processes the information according to template 
files 1345. and returns a search result 302 to the user. 
[0147] HTML documents are scattering over networks 
and have different locations, tags, and information ele- 
ments. To cope with these differences and extract infor- 
mation item by item from them, the apparatus 100 



individually manages the locations and document struc- 
tures of the HTML documents for each HTML docu- 
ment In addition, the apparatus 100 provides a search 
result in a unified form such as a table. 

5 [0148] The user interface unit 1 1 receives the query 
statement 301 entered by the user through the applica- 
tion program 3 and transmits it to the HTML document 
access unit 14. According to a URL or a file name pro- 
vided by the user interface unit 1 1 , the HTML document 

10 access unit 14 refers to the proxy setting file 141 and 
acquires an HTML document (4-1 , 4-2). The HTML doc- 
ument is transferred to the template analysis unit 1341 . 
If the HTML document contains link data, the template 
analysis unit 1341 extracts linked URLs according to 

75 which the HTML document access unit 14 refers to the 
proxy setting file 141 if necessary and acquires HTML 
documents (4-1 , 4-2) having the linked URLs. Figure 41 
shows an example of the proxy setting file 141 that 
specifies proxy server names and proxy port numbers, 

20 that is, the location data of proxy server necessary for 
acquiring HTML documents and is referred by the 
HTML document access unit 14. Figure 42 shows an 
example of one of the template files 1 345 that specifies 
parts that are extractable as Hems and items to be 

25 extracted in extraction text specifying parts. The tem- 
plate file also specifies data types of the items to be 
extracted. The template files 1345 are refened by the 
template analysis unit 1341. The URL-template table 
1342 shown in Fig. 43 manages relationships between 

30 URLs or file names and template files and is referred by 
the template analysis unit 1341. The template analysis 
unit 1341 fetches the name of a template file corre- 
sponding to the query statement 301 from the URL-tem- 
plate table 1342. At the same time, the template 

35 analysis unit 1 341 refers to the template file 1 345 for the 
acquired name of the template file and analyzes and 
acquires extractable parts, items to be extracted, and 
data types of the items to be extracted of the HTML doc- 
ument in query. The acquired data is transferred from 

40 the template analysis unit 1 341 to the template process- 
ing unit 1343. The template analysis unit 1341 also 
determines whether or not there are linked URLs in the 
template file 1345. If there are linked URLs, they are 
transferred to the HTML document access unit 14, 

45 which acquires linked HTML documents accordingly. 
According to the extractable parts, the items to be 
extracted, and the data types of the items to be 
extracted from the template analysis unit 1341 , the tem- 
plate processing unit 1343 extracts item data from the 

so HTML documents. The retrieval result conversion unit 
135 receives the extracted information and the data 
types thereof from the template processing unit 1343 
and carries out conversion on the extracted information 
according to the data types. The converted information 

55 is sent as a search result 302 to the user through the 
user interface unit 11. 

[0149] The apparatus 100 of the third embodiment, or 
any one of the apparatuses of the first and second 



22 



43 



EP 0 964 341 A2 



44 



embodiments, may be realized with a computer having 
a CPU. memories, I/O devices, external storage 
devices, etc., and a medium for recording a program 
that provides the functions of the present invention 
when being read by the computer. 5 
[0150] The proxy server 2 acts as an intermecfiary to 
acquire HTML document specifiable by the apparatus 
1 00 and returns an HTML document (4-1 , 4-2) specified 
by an URL to the apparatus 100. The HTML documents 
4-1 and 4-2 are tagged text f fle constituting home pages 
scattering over open networks. The application program 
3 receives from a user a search request at least contain- 
ing a URL or file name and search items, gets a search 
result for the search request from the apparatus 100, 
and provides the user with the search result. 
[0151] Processing steps carried put by the apparatus 
100 of the third embodiment will be explained. The 
steps are carried out in a preparatory phase preparing 
data such as presentation style before retrieval of Fig. 
40 and an execution phase of Fig. 44. The preparatory 
phase of Fig. 40 is prepared by a managing person with 
the use of, for example, an editor but not by operating 
the whole of the apparatus 1 00. 

(1) Preparatory phase 

[0152] The preparatory phase of Fig. 40 will be 
explained. Step S605 sets a proxy server name and a 
proxy port number to form the proxy setting file 141 of 
Fig. 41 , if proxy server needed (S600Y). Step S610 pre- 
pares a template file. The template file has a unique 
name among all template files and contains the follow* 
ingdata (Fig. 42): 

(a) Items to be extracted 

[01 53] In formation about items to be extracted corre- 
sponds to keyword "Word" 

[01 54] The template file stipulates the names of items 
from which information pieces are extracted, the data 
types of the items, and fixed values added to the items. 
In the example of Fig. 42, the data type is T to indicate 
a character type: Note that the data type may be set 
according to desired filtering processing such as "3" for 
a numeric value type, or "4" for a character string adding 
type. The template file of Fig. 42 includes a linked 
address (URL's relative path) at the portion headed 
"Next URL" These pieces of data type and fixed value 
are needed when adding or deleting information with 
respect to a search result to be returned to a user. 

(b) Text extraction specifying part 

[0155] Information about text to be extracted corre- 
sponds to the portion headed "HTML Template" 
[0156] A record that contains information to be 
extracted is copied from a target HTML document (Web 
page). A required information part is replaced with 



t item nameS" and each pari in the record that can be 

omitted is replaced with an omit mark V. 

[0157] Ha given for HTML document includes partial 

structure to be handled as character string specifying 

the end of same tables are set. In the example ol Fig. 

42, there are first, second and third tables and related 

items. 

[0158] H there is any linked URL, character string for 
specifying the linked URL are set. Thereafter, step S620 
prepares the URL-template table 1342 containing URLs 
or f fle names and corresponding template file names, as 
shown in Fig. 43. 

(2) Execution phase 

[01 59] Figure 44 shows steps in the execution phase 
for extracting information from items of a given HTML 
document according to the third embodiment. 
[0160] In step S700, the user interface unit 11 
receives a query statement entered by a user through 
the application program 3. The query statement 
includes a URL or a file name and search items. If the 
query statement include a URL, the HTML document 
access unit 14 refers to the proxy setting file 141 if the 
corresponding file 141 is defined (4-1) and acquires an 
HTML document having the U RL If the query statement 
contains a file name, a local HTML document having the 
file name is specified. According to the URL or file name 
and the contents of the proxy setting file 141 , the HTML 
document access unit 14 acquires an HTML document 
directly or through the proxy server 2 and receives a 
corresponding HTML document in step S710. 
[01 61 ] In step S720, the template analysis unit 1 341 
checks to see if there is a template file 1345 corre- 
sponding to the URL. Namely, the template analysis unit 
1341 searches the URL-template table 1342 for the 
URL or file name stipulated in the query statement. If 
there is no corresponding template file (Step S720N), 
the template analysis unit 1 341 sends an error message 
to the user interface unit 11 . If there is a corresponding 
template file, the template analysis unit 1341 fetches 
the template file from among the template files 1345. 
analyzes extraction rules stipulated in the template file, 
and transfers the extraction rules to the template 
processing unit 1343, in step S730. 
[0162] In step S740, the template processing unit 
1343 extracts information item by item from the HTML 
document (4-1, 4-2) according to the extraction rules 
obtained from the template file 1345 and stores the 
extracted information in a table. In step S750, the tem- 
plate processing unit 1343 analyzes the extraction rules 
and determines whether or not there is a linked URL If 
there is (Step S750Y), the template processing unit 
1343 transfers the linked URL to the HTML document 
access unit 1 4, which acquires an HTML document hav- 
ing the linked URL The acquired HTML document with 
the linked URL is subjected to the steps S730 to S750. 
[0163] The retrieval result conversion unit 135 refers 
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to the template f fle 1 345 to carry out the following proc- 
esses on the extracted items of information: 

a) executing no processes on item data whose data 
type are ruled to display information as ft is; 

b) returning fixed values from the retrieval result 
conversion unit 135 for items whose data type are 
ruled to have the fixed values even if the HTML doc- 
ument contains no corresponding information; 

c) deleting commas from numeric values for item 
data whose data type are ruled to do so; and 

d) adding fixed values such as relative URL paths to 
item data whose data type are ruled to have such 
additional values. 

[01 64] According to these pieces of data, the retrieval 
result conversion unit 135 prepares a search result and 
transmits it to the application program 3 through the 
user interface unit 11. 

[0165] Figures 45 to 48 show examples of extracting 
information item by item according to the third embodi- 
ment, in which Fig. 45 is a display of an HTML docu- 
ment on a Web browser, Fig. 46 is a part of HTML 
description corresponding to the display of Fig. 45, and 
Fig. 47 shows a template file for extracting information 
item by item from the HTML document of Figs. 45 and 
46. The template file includes items to be extracted, i.e., 
"racename." "grade." "circle," "mmdd." ^distance," "con- 
dition." time," "winhorse," "sex_age," "jockey," teki 
(trainer)," and "urt" The template file also indudesatext 
extraction specifying part for extracting these items. Fig- 
ure 48 shows an example of information extraction from 
the HTML document of Figs. 45 and 46 according to the 
template file of Fig. 47. This example is based on that 
the application program 3 specifies or selects "jockey." 
"winhorse," and "racename" as search items. 
[0166] Figs. 42, 49 to 52 show a modification of the 
third embodiment The template file of Fig. 42 of the 
third embodiment contains the first and second tables 
that are partial structures consisting of the same ele- 
ments for the same HTML document. Here, the partial 
structure is data group related to one subject such as 
table, list and clause. On the other hand, the modifica- 
tion extracts required information item by Hern by 
employing a template file that contains items having dif- 
ferent attributes for the same HTML document, or a 
template file that contains partial structures having dif- 
ferent elements for the same HTML document, or a tem- 
plate file that is applicable for an HTML document 
including link information. 

[01 67] Figs. 49 and 50 show examples of displays on 
a Web browser of HTML documents showing shop 
information. These HTML documents have each three 
tables having same structures. Figure 51 shows an 
HTML description corresponding to the HTML docu- 
ment of Fig. 49, and Fig. 52 shows an HTML description 
corresponding to the HTML document of Fig. 50. Fig. 42 
shows a template file for extracting information item by 



item from the HTML documents of Figs. 49 to 52. The 
template file of Fig. 42 contains TaWeEndDelimiter" to 
indicate the end of a partial structure such as a table, list 
or a clause, the names of Hems to be extracted in words, 

5 data types of the items in words, and a text extraction 
specifying part "HtmJTemplate." For example. TaWeEnd- 
Delimiter o (/TABLE > indicates that an appearance of 
(/TABLE )specif ies the end of a partial structure. 
[01 68] < A HREF * "./html_2.html" )in Fig. 51 indicates 

10 a link to the HTML document of Fig. 52. The template 
analysis unit 1341 analyzes this link information. 
According to the link information and "NextURL" in the 
template ffle of Fig. 42, the template processing unit 
1 343 extracts information not only from the items of the 

15 HTML document of Fig. 49 but also from the Hems of the 
HTML document of Fig. 50. 

[0169] First and second tables in the HTML descrip- 
tion of Fig. 51 are two partial structures having the same 
document structure and the same data types. According 

so to the descriptions about the first and second structures 
in the template f fle of Fig. 42, the template processing 
unit 1 343 extracts item data in the partial structures hav- 
ing the same structure in the same HTML document. 
The HTML description of Fig. 52 has the same structure 

25 as that of Fig. 51 , and therefore, information is extracted 
item by item therefrom according to the template file of 
Fig. 42. 

[01 70] The first and second tables in the HTML docu- 
ment of Fig. 51 are two partial structures having differ- 

30 ent attributes, in particular, presentation attribute. 
Among information pieces in an item "Genre" in the 
HTML document of Fig. 51 , some are delimited with ( I ) 
and </l)and some are not. The tag "/I" indicates to dis- 
play a corresponding information piece in italic. A tag 

35 7B" indicates to display a corresponding information 
piece in bold. In the template file of Fig. 42, these infor- 
mation for different attributes are defined with two 
descriptions, which are applied to one line of a corre- 
sponding partial structure of the HTML documents. If a 

40 given HTML document agrees with one of the descrip- 
tions, item information is extracted from corresponding 
the HTML document. In Fig. 42, an omission tag ".." is 
used for the item "Genre" to extract information pieces 
from the item without regard to the presentation 

45 attribute thereof. 

[0171] In Fig. 51, a third table is a partial structure 
having an element "Evaluation" that is not in the first and 
second tables. A description about the third table in Fig. 
53 enables the template processing unit 1343 to extract 

so partial structures having different elements in the same 
HTML document. 

[0172] As explained above, the third embodiment 
manages data about information contained in plural 
HTML documents, extracts information item by Hem 
55 from the HTML documents according to the data, and 
provides a user with required information in a unified 
form such as a table. The third embodiment prepares a 
text extraction specifying part to specify mere items 
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from which information must be extracted according to a 
user's request thereby making the formation and main- 
tenance of the retrieval system easier. The third embod- 
iment retrieves information item by item from HTML 
documents scattering over open networks without 
regard to varying interlaces attached to the HTML doc- 
uments, and provides each user with required informa- 
tion in a required form. 

[0173] The third embodiment employs template files 
that are independent of HTML syntax rides, to extract 
required information item by item from HTML docu- 
ments, if the HTML documents have items delimited 
with, for example, tags. The third embodiment extracts 
information item by item from HTML documents only by 
preparing template files that define the items from which 
information is extracted. The template files can easily be 
prepared according to target HTML documents and are 
visually understandable. Consequently, the third 
embodiment easily and flexibly extracts information item 
by item from HTML documents. 
[0174] h is to be noted that, besides those already 
mentioned above, many modifications and variations of 
the above embodiments may be made without depart- 
ing from the novel and advantageous features of the 
present invention. Accordingly, all such modifications 
and variations are intended to be included within the 
scope of the appended claims. 

Claims 

1 . An apparatus for retrieving data contained in a plu- 
rality of semi-structured documents over open net- 
works, comprising: 

a unit (15) for storing meta data for each of the 
semi-structured documents, the meta data 
including items to be extracted from the semi- 
structured documents and item data used to 
conditionally retrieve the items; 
a unit (13) for retrieving data scattered among 
the semi-structured documents for entered 
query according to the meta data, and prepar- 
ing a collective search result; and 
a unit (11) for outputting the search result in a 
prescribed single format that is specific to each 
user. 

2. An apparatus for retrieving data contained in a plu- 
rality of semi -structured documents over open net- 
works, comprising: 

(a) a unit (1 5) for storing location data about the 
location of each of the semi-structured docu- 
ments, document structure data about the 
structure of each of the semi-structured docu- 
ments, used to delimit document into items to 
be extracted, attribute data about the attributes 
of each of the items to be extracted, used to 



conditionally retrieve the items, and style con- 
version data used to convert item presentation 
styles of the user and item presentation styles 
of the semi-structured documents from one 
5 into another; 

(b) a unit (131) for finding, according to the 
location data, the location of a semi-structured 
document that contains all search items speci- 
fied in an entered query that consists of the 

io search items and search conditions; 

(c) a unit (132) for converting, if necessary, item 
presentation styles of the entered query into 
item presentation styles of the search Item in 
location found semi-structured documents 

is according to the style conversion data, and 

forming queries for the location found semi- 
structured documents; 

(d) a unit (14) for transmitting the queries pro- 
vided by the unit (c) to the found locations and 

so acquiring the semi-structured documents; 

(e) a unit (134) for extracting Kern data from the 
acquired semi-structured documents accord- 
ing to the document structure data, selecting 
the extracted rtem'data, if necessary, according 

25 to the attribute data for the search condition, 

and preparing a search result; and 

(f) a unit (135) for converting, if necessary, item 
presentation styles of the search result into the 
item presentation styles of each user according 

30 to the style conversion data. 

3. The apparatus of claim 2, further comprising: 

(g) a unit (1345) for storing, for each of the 
35 semi-structured documents, a template that 

stipulates at least item name to be extracted 
and prescribed text extraction style data of item 
group to be extracted according to the docu- 
ment structure data, 

40 wherein the unit (e) compares the acquired 

semi-structured document with corresponding 
templates by scanning the acquired semi- 
structured document; and 
extracts item data of the items watching the text 

45 extraction style data of the template so as to 

preparing the search result. 

4. The apparatus of claim 3, wherein: 

so the unit (e) shapes the search result into a 

table. 

5. The apparatus of claim 3, wherein, if the text extrac- 
tion style data of a given template includes link data 

55 to another semi-structured document : 

the unit (e) scans a linked semi-structured doc- 
ument and compares the linked semi-struc- 
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tured document with the template. 

6. The apparatus of claim 3. wherein: 

any template that is for a semi-structured docu- s 
ment having a plurality of partial structures of 
the same structure contains text extraction 
style data for each of the partial structures; and 
the unit (e) extracts the item data so as to pre- 
pare the search result for each of the partial 10 
structures. 

7. The apparatus of claim 3, wherein: 

the template contains a plurality pieces of text is 
extraction style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the unit (e) extracts item data of the matching 20 
the text extraction style data, by scanning the 
acquired semi-structured document when the 
partial structure of the semi-structured docu- 
ment match any one piece of the text extraction 
style data. 25 

8. The apparatus of claim 3. wherein: 

any template that is for a semi-structured docu- 
ment having a plurality of partial structures 30 
containing mutually different elements contains 
text extraction style data for each of the partial 
structures; and 

the unit (e) extracts the item data so as to pre- 
pare the search result for each of the partial 35 
structures. 

9. An apparatus for retrieving data through search 
engines over open networks, comprising: 

40 

(aa) a unit (150) for storing location data about 
the location of each search engine, essential 
input item data specifying essential input items 
required by an input form of each search 
engine, document structure data about the 45 
structure of each HTML document, used to 
delimit document into items to be extracted, 
attribute data about the attributes of the items 
to be extracted, used to conditionally retrieve 
the Hems, and style conversion data used to so 
convert item presentation styles of a user and 
item presentation styles of each HTML docu- 
ment from one into another; 
(bb) a unit (131) for finding, according to the 
location data, the location of a search engine 55 
that contains all search items specified in an 
entered query that consists of the search items 
and search conditions; 



(cc) a unit (136) for selecting, according to the 
essential input item data, search engine to be 
searched from among the location found 
search engines, the search engine of which the 
essential input item satisfy the specified search 
condition; 

(dd) a unit (137) for determining an optimum 
retrieval pattern for each of the selected search 
engines according to a matrix table and con- 
verting the entered query into queries for the 
selected search engines accordingly, the 
matrix table defining combination between the 
search Items and search conditions and the 
items and essential input items of each search 
engine; 

(ee) a unit (132) for converting, if necessary, 
item presentation styles of the queries provided 
by the unit (dd) into Kern presentation styles of 
the search item in selected search engines 
according to the style conversion data; 
(ff) a unit (14) for transmitting the queries pro- 
vided by the unit (ee) to the found locations and 
acquiring HTML documents; 
(gg) a unit (138) for extracting item data from 
the acquired HTML document serving as a first 
search result according to the structure data, 
selecting the extracted item data, if necessary, 
according to the attribute data for the search 
condition on the basis of corresponding 
retrieval pattern and preparing a second 
search result; and 

(hh) a unit (135) for converting, if necessary, 
item presentation styles of the second search 
result into item presentation styles of each user 
according to the style conversion data. 

10. TTie apparatus of claim 9, further comprising: 

(ii) a unit (1345) for storing, for each HTML doc- 
ument, a template that stipulates at least item 
name to be extracted and prescribed text 
extraction style data of item group to be 
extracted according to the document structure 
data, 

wherein the unit (gg) compares the acquired 
HTML document with corresponding the tem- 
plate by scanning the acquired HTML docu- 
ment serving as the first search result; and 
extracts item data of the items matching the 
text extraction style data of the template so as 
to prepare the second search result. 

11. The apparatus of claim 10. wherein: 

the unit (gg) shapes the search result into a 
table. 

12. The apparatus of daim 10, wherein, if the text 
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extraction style data of a given template includes 
link data to another HTML document.: 

the unit (gg) scans a linked HTML document 
and compares the linked HTML document with 5 
the template. 



by scanning the acquired HTML document, 
and extracting item data of the items matching 
the text extraction style data of the template, so 
as to prepare a search result. 

17. The apparatus of claim 16. wherein: 



1 3. The apparatus of claim 10. wherein: 

any template that is for an HTML document 
having a plurality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and 
the unit (gg) extracts the item data so as to pre- 
pare the search result for each of the partial 
structures. 

14. The apparatus of claim 10. wherein: 

the template contains a plurality pieces of text 
extraction style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the unit (gg) extracts item data of the items 
matching the text extraction style data, by 
scanning the acquired HTML document when 
the partial structure of the HTML document 
match any one piece of the text extraction style 
data. 

15. The apparatus of claim 10. wherein: 

any template that is for an HTML document 
having a plurality of partial structures contain- 
ing mutually different elements contains text 
extraction style data for each of the partial 
structures; and 

the unit (gg) extracts the item data so as to pre- 
pare the search result for each of the partial 
structures. 

16. An apparatus for extracting data item by item from 
arbitrary HTML document over open networks, 
comprising: 

(aaa) a unit (1345) for storing a template for 
each HTML document according to document 
structure data about the structure of the HTML 
document used to delimit document into items 
to be extracted, the template stipulating at least 
item name to be extracted and prescribed text 
extraction style data of item group to be 
extracted from the HTML document; 
(bbb) a unit (1341 ) for analyzing a template cor- 
* responding to acquired HTML document; and 
(ccc) a unit (1343) for comparing the acquired 
HTML documents with corresponding template 



the unit (ccc) shapes the search result into a 
table. 

10 

18. The apparatus of claim 16. wherein, if the text 
extraction style data of a given template includes 
link data to another HTML document: 

is the unit (ccc) scans a linked HTML document 

and compares the linked. HTML document with 
the template. 

19. The apparatus of claim 16. wherein: 

20 

any template that is for an HTML document 
having a plurality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and 
25 the unit (ccc) extracts the item data so as to 

prepare the search result for each of the partial 
structures. 

20. The apparatus of claim 16, wherein: 

30 

the template contains a plurality pieces of 
extraction text style data for each of partial 
structures, the text extraction style data being 
used for filtering uneven parts contained in the 

35 partial structure; and 

the unit (ccc) extracts item data of the items 
matching the extraction style data, by scanning 
the acquired HTML document, when the partial 
structure of the HTML document match any 

40 one piece of the extraction text style data. 

21. The apparatus of claim 16, wherein: 

any template that is for an HTML document 
45 having a plurality of partial structures contain- 

ing mutually different elements contains text 
extraction style data for each of the partial 
structures; and 

the unit (ccc) extracts the item data so as to 
so prepare the search result for each of the partial 

structures. 

22. A method of retrieving data contained in a plurality 
of semi-structured documents over open networks, 

55 comprising the steps of: 

retrieving data scattered among semi-struc- 
tured documents for entered query according 
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to meta data about each of the semi-structured 
documents and preparing a collective search 
result, the meta data including items to be 
extracted from the semi-structured documents 
and item data used to conditionally retrieve the 5 
items; and 

outputting the search result in a prescribed sin- 
gle format that is specific each the user. 

23. A method of retrieving data contained in a plurality 10 
of semi-structured documents over open networks, 
comprising the steps of: 

(a) finding, according to location data that 
specifies the location of each of the semi-struc- is 
tured documents, the location of a semi-struc- 
tured document that contains all search items 
specified in an entered that consists of the 
search items and search conditions (s210); 

(b) converting, if necessary, item presentation 20 
styles of the entered query into item presenta- 
tion styles of the search item in location found 
semi-structured documents according to style 
conversion data and forming queries for the 
location found semi-structured documents, the 25 
style conversion data being used to convert 
item presentation styles of a user and item 
presentation styles of the semi-structured doc- 
uments from one into another (S220.S230); 

(c) transmitting the queries provided by the bo 
step (b) to the found locations and acquiring 
the semi-structured documents (S240); 

(d) extracting item data from the acquired semi- 
structured documents according to document 
structure data, selecting the extracted item 35 
data, if necessary, according to attribute data 

for the search condition and preparing a search 
result, the document structure data specifying 
the structure of each of the semi-structured 
documents and being used to delimit document 40 
into items to be extracted, the attribute data 
specifying the attributes of each item to be 
extracted and being used to conditionally 
retrieve the items (S240); and 

(e) converting, rf necessary, item presentation 45 
styles of the search result into the item presen- 
tation styles of each user according to the style 
conversion data (S250). 



24. A method of retrieving data through search engines 
over open networks, comprising the steps of: 
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(aa) finding, according to location data that 
specifies the location of each search engine, 
the location of a search engine that contains all 55 
search items specif ied in an entered query that 
consists of the search items and search condi- 
tions (S410.S420); 



(bb) selecting, according to essential input item 
data that specifies essential input items 
required by an input form of each search 
engine, search engine to be searched from 
among the location found search engines, the 
search engine of which the essential input item 
satisfy the specified search condition (S430); 
(cc) determining an optimum retrieval pattern 
for each of the selected search engines 
according to a matrix table and converting the 
entered query into queries for the selected 
search engines accordingly, the matrix table 
defining combination between the search items 
and search conditions and the items and 
essential input items of each search engine 
(S440); 

(dd) converting, if necessary, item presentation 
styles of the queries provided by the step (cc) 
into item presentation styles of the search item 
in selected search engines according to style 
conversion data that is used to convert item 
presentation styles of a user and item presen- 
tation styles of each HTML document from one 
into another; (S450.S460) 
(ee) transmitting the queries obtained by the 
step (dd) to the found location and acquiring 
HTML documents (S470); 
(ff) extracting item data from the aoquired 
HTML document serving as first search result 
according to document structure data (S475), 
selecting, if necessary, the extracted item data 
according to attribute data for the searching 
condition on the basis of corresponding 
retrieval pattern, and preparing a second 
search result (S480.S490), the document 
structure data specifying the structure of each 
HTML document and bang used to delimit doc- 
ument into items to be extracted, the attribute 
data specifying the attributes of the items to be 
extracted and being used to conditionally 
retrieve the items; and 

(99) converting, if necessary, item presentation 
styles of the second search result into item 
presentation styles of each user according to 
the style conversion data (S500). 

25. A method of extracting data item by item from arbi- 
trary HTML document over open networks, com- 
prising the steps of: 

(aaa) analyzing a template corresponding to 
acquired HTML document, the template for 
each HTML document being set according to 
document structure data that specifies the 
structure of each HTML document and is used 
to delimit document into items to be extracted, 
the template stipulating at least item name to 
be extracted and prescribed text extraction 
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style data of item group to be extracted from 
the corresponding HTML document (S730); 
and 

(bbb) comparing the acquired HTML docu- 
ments with corresponding template by scan- 5 
ning the acquired HTML document, and 
extracting item data of the items watching the 
text extraction style data of the template, so as 
to prepare a search result (S740.S750). 

10 

26. A computer readable recording medium recording a 
program for causing the computer to execute 
processing for retrieving data contained in a plural- 
ity of semi-structured documents over open net- 
works, the processing including: is 

a process for retrieving the data scattered 
among semi-structured documents for entered 
query according to meta data about each of the 
semi-structured documents and preparing a so 
collective search result, the meta data includ- 
ing items to be extracted from the semi-struc- 
tured documents and item data used to 
conditionally retrieve the items; and 
a process for outputting the search result in a 25 
prescribed single format that is specific each 
the user. 

27. A computer readable recording medium recording a 
program for causing the computer to execute 30 
processing for retrieving data involved in a plurality 

of semi-structured documents over open networks, 
the processing including: 

(a) a process (131) for finding, according to 35 
location data that specifies the location of each 

of the semi-structured documents, the location 
of a semi-structured document that contains all 
search items specified in an entered that con- 
sists of the search items and search conditions; 40 

(b) a process (132) for converting, rf necessary, 
item presentation styles of the entered query 
into item presentation styles of the search item 
in location found semi-structured documents 
according to style conversion data and forming 45 
queries for the location found semi-structured 
documents, the style conversion data being 
used to convert Hem presentation styles of a 
user and item presentation styles of the semi- 
structured documents from one into another; so 

(c) a process (14) for transmitting the queries 
provided by the process (b) to the found loca- 
tions and acquiring the semi-structured docu- 
ments; 

(d) a process (134) for extracting item data 55 
from the acquired semi-structured documents 
according to document structure data, select- 
ing the extracted item data, rf necessary, 



according to attribute data for the search condi- 
tion and preparing a search result the docu- 
ment structure data specifying the structure of 
each of the semi-structured documents and 
being used to delimit document into items to be 
extracted, the attribute data specifying the 
attributes of each item to be extracted and 
being used to conditionally retrieve the items; 
and 

(e) a process (135) converting, if necessary, 
item presentation styles of the search result 
into the item presentation styles of each user 
according to the style conversion data. 

28. The recording medium of daim 27, wherein the 
process (d) 

compares the acquired semi-structured docu- 
ment with corresponding template, the tem- 
plate stipulating, for each of the semi- 
structured documents, at least item name to be 
extracted and prescribed text extraction style 
data of item group to be extracted according to 
the document structure data; and 
extracts item data of the items matching the 
text extraction template so as to prepare the 
search result. 

29. The recording medium of claim 28, wherein the 
process (d) shapes the search result into a table. 

30. The recording medium of claim 28, wherein, rf the 
text extraction style data of a given template 
includes link data to another semi-structured docu- 
ment, the process (d) scans a linked semi-struc- 
tured document and compares the linked semi- 
structured document with the template. 

31. The recording medium of claim 28, wherein: 

any template that is for a semi-structured docu- 
ment having a plurality of partial structures of 
the same structure contains text extraction 
style data for each of the partial structures; and 
the process (d) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

32. The recording medium of claim 28, wherein: 

the template contains a plurality pieces of text 
extraction style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the process (d) extracts item data of the items 
matching the text extraction style data, by 
scanning the acquired semi-structured docu- 
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merit when the partial structure of the semi- 
structured document match any one piece of 
the extraction text style data. 

33. The recording medium of daim 28, wherein: 5 

any template that is for a semi-structured docu- 
ment having a plurality of partial structures 
containing mutually different elements contains 
text extraction style data for each of the partial io 
structures; and 

the process (d) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

15 

34. A computer readable recording medium recording a 
program for causing the computer to execute 
processing for retrieve data through search engines 
over the open networks, the processing including: 

20 

(aa) a process (131) for finding, according to 
location data that specifies the location of each 
search engine, the location of a search engine 
that contains all search hems specified in an 
entered query that consists of the search items 25 
and search conditions; 

(bb) a process (136) for selecting, according to 
essential input item data that specifies essen- 
tial input items required by an input form of 
each search engine, search engine to be 30 
searched from among the location found 
search engines, the search engine of which the 
essential input item satisfy the specified search 
condition; 

(cc) a process (137) for determining an opti- 35 
mum retrieval pattern for each of the selected 
search engines according to a matrix table and 
converting the entered query into queries for 
the selected search engines accordingly, the 
matrix table defining combination between the 40 
search items and search conditions and the 
items and essential input items of each search 
engine; 

(dd) a process (132) for converting, if neces- 
sary, item presentation styles of the queries 45 
provided by the process (cc) into item presen- 
tation styles of the search Hem in selected 
search engines according to style conversion 
data that is used to convert item presentation 
styles of a user and item presentation styles of so 
each HTML document from one into another; 
(ee) a process (14) for transmitting the queries 
obtained by the process (dd) to the found loca- 
tion and acquiring HTML documents; 
(ff) a process (138) for extracting item data 55 
from the acquired HTML document serving as 
first search result according to document struc- 
ture data, selecting, if necessary, the extracted 



item data according to attribute data for the 
searching condition on the basis of corre- 
sponding retrieval pattern, and preparing a 
second search result, the document structure 
data specifying the structure of each HTML 
document and being used to delimit document 
into items to be extracted, the attribute data 
specifying the attributes of the items to be 
extracted and being used to conditionally 
retrieve the items; and 

(gg) a process (135) for converting, if neces- 
sary, item presentation styles of the second 
search result into item presentation styles of 
each user according to the style conversion 
data. 

The recording medium of claim 34, wherein the 
process (ff) 

compares the acquired HTML document with 
corresponding template, the template stipulat- 
ing, for each of HTML documents, at least item 
name to be extracted and prescribed text 
extraction style data of item group to be 
extracted according to the document structure 
data; and 

extracts item data of the items matching the 
text extraction style data of the template so as 
to prepare the search result 

36. The recording medium of claim 35. wherein the 
process (ff) shapes the search result into a table. 

37. The recording medium of claim 35, wherein, if the 
text extraction style data of a given template link 
data to another document, the process (ff) scans a 
linked HTML document and compares the linked 
HTML document with the template. 

38. The recording medium of claim 35, wherein: 

any template that is for an HTML document 
having a plurality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and 
the process (ff) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

39. The recording medium of claim 35, wherein: 

the template contains a plurality pieces of text 
extraction style data for each of partial struc- 
tures, the text extraction style data being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the process (ff) extracts item data of the items 
matching the text extraction style data, by 
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scanning the acquired HTML document, when 
the partial structure of the HTML document 
match any one piece of the extraction text style 
data. 

40. The recording medium of claim 35, wherein: 

any template that is for an HTML document 
having a plurality of partial structures contain- 
ing mutually different elements contains text 
extraction style data for each of the partial 
structures; and 

the process (ff) extracts the item data so as to 
prepare the search result for each of the partial 
structures. 

41. A computer readable recording medium recording a 
program for causing the computer to execute 
processing for extracting data item by item from 
arbitrary HTML documents over open networks, the 
processing including: 

(aaa) a process (1 341 ) for analyzing a template 
corresponding to acquired HTML document, 
the template for each HTML document being 
set according to document structure data that 
specifies the structure of each HTML docu- 
ment and is used to delimit document into 
items to be extracted, the template stipulating 
at least item name to be extracted and pre- 
scribed text extraction style data of item group 
to be extracted from the corresponding HTML 
document; and 

(bbb) a process (1343) for comparing the 
acquired HTML documents with corresponding 
the template by scanning the aoquired HTML 
document, and extracting item data of the 
items matching the text extraction style data of 
the template, so as to prepare a search result. 

42. The recording medium of claim 41, wherein the 
process (bbb) shapes the search result into a table. 
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45. The recording medium of claim 41, wherein: 

the template contains a plurality pieces of text 
extraction style data for each of partial struc- 
tures, the text extraction style data.being used 
for filtering uneven parts contained in the par- 
tial structure; and 

the process (bbb) extracts item data of items 
matching the extraction text style data, in par- 
tial structures thereof according to the first and 
second extraction style data of corresponding 
ones of the templates by scanning the obtained 
HTML document; when the partial structure of 
the HTML document match any one piece of 
the extraction text style data. 

46. The recording medium of claim 41, wherein: 

any template that is for an HTML document 
having a plurality of partial structures contain- 
ing mutually different elements contains text 
extraction style data for each of the partial 
structures; and 

the process (bbb) extracts the item data so as 
to prepare the search result for each of the par- 
tial structures. 



43. The recording medium of claim 41, wherein, if the 
text extraction style data of a given template 45 
includes link data to another document, the process 
(bbb) scans a linked HTML document and com- 
pares the linked HTML document with the template. 

44. The recording medium of claim 41 , wherein: so 



any template that is for an HTML document 
having a plurality of partial structures of the 
same structure contains text extraction style 
data for each of the partial structures; and ss 
the process (bbb) extracts the item data so as 
to prepare the search result for each of the par- 
tial structures. 
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FIG.9A 

EXEMPLARY DISPLAY BY WEB BROWSER 
TITLE:SHOP A PRODUCT INFORMATION 
URL:http-7/www.shop_a.co.jp/products.html 



PRODUCT INFORMATION 



PRODUCT NAME 


PRICE 


Maker A/PCI 


¥170,000 


Maker A/PC2 


¥238,000 


Maker B/PC101 


¥198,000 



FIG.9B 



HTML DOCUMENT 



<BODY> 

<Hl>PRODUCT INFORMATIONS 1 > 

STABLE BORDER> 
<TRxTH>PRODUCT NAl^<^TH><TH>PRICT</m></TR> 
<TRxTD>Maker A/PC 1 </TDxTD>¥ 1 70,000</rD></TR> 
<TRxTD>Maker A/PC2</TDxTD>¥238 ,000</rDx/rR> 
<TRxTD>Maker B/PC101<ATJ><riT)>¥198,0(X)</TD></TR> 
</TABLE> 
</BODY> 
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FIG.lOA 

EXEMPLARY DISPLAY BY WEB BROWSER 
TTTLE-.SHOP B PRODUCT INFORMATION 
URL:http://www.shop_b.co.jp/shouhin.html 

PRODUCT INFORMATION 

MAKER NAME/PRODUCT NAME /PRICE 

1. Maker A/PC1/168.000YEN 

2. Maker B/PC101/208,000YEN 

3. Maker B/PC1Q2/248.000YEN 



FIG.10B 

HTML DOCUMENT 



<BODY> 

<Hl>PRODUCT INFORMATION </Hl> 

<H3>MAKER NAME/PRODUCT NAME /PRICE </H3> 
<OL> 

<Ll>Maker A/PC 1/1 68.000YEN 

<Ll>Maker B/PC101/208.000YEN 

<Ll>Maker B/PC102/248.000YEN 
</OL> 
</BODY> 
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FIG.20 



Page_B HTML 

< ! DOCTYPE HTML PUBLIC-//W3C//DTD W3 HTML//EN"> 

<HTML> 

<HEAD> 

<TrTLE>page_B</nTLE> 

</HEAD> 

<BODY> 

<FORM action=http://www.page_b.co.jp/cgi-bin/search.cgi method=GET> 

SHOPS TO FIND<BRxBR> 

<TABLE> 

<TR> 

<TH> 

SHOP NAME 

</TH> 

<TH> 

AREA 

</TH> 

</TR> 

<TR> 

<TD> 

<INPUT name=key size=30> 

</TD> 

<TD> 

<SELECT name-area> 

<OPTION value=00 SELECTED> 

<OPTION value=01>YOKOSUKA-SHI 

<OPTION value=02>FUJISAWA-SHI 

<OPTION value=03>H!RATSUKA-SHl 

<OPT10N valuc=04>ATSUGI-SHI 

<OPT10N value=05>ZUSHI-SHI 

<OPTION value=06>SAGAMIHARA-SHI 

<OPTION value=07>YOKOHAMA-SHI 

<OPTION value=08>CHIGASAKI-SHI 
</SELECT> 
</TD> 
</TR> 

</TABLExBR> 

<INPUT type=submit valuc=SEARCH> 

<INPUT type=reset value=CLEAR> 

</FORM> 

</BODY> 

</HTML> 
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<HTML> 
<HEAD> 

<TTnj^¥ifcio#jfc!lttt*»*>£-R$ («*> </title> 

</HEAD> 

<BODY BCCOLOR="#ffffTT> 
<P> 

<DIV ALICN=CENTERxFONT SEEs5><B>¥ftlO**Mtt*B*>#--Rfc (W3K) 

</Bx/FO?<rxBR><DIV> 

<TABL£ BORDER> 

<TR> 

<TH ALIGN=C ENTER BCCOtOR="#«0fffrNOWRAP>t^i*</TH><TH ALIGN=CENTER BGCOLOR = 

-#80fffr 

N0WRAJ>>I6</THXTH ALIGN=CENTER BGCOLOR="#80fiTrNOWRAP>*</TH><TH AUGN*€ENTER 
BCKX)LOR=-#80frrrNOWRAP>fl B</THxm AL!GN=CENTER BGCOLOR="#80fffTNOWRAP>lgldl 
(m) 

</THxTH ALIGN=CENTER BGCOU>R»"#80fTfrNOWRAP>^« • E*</THxTH ALIGN=CENTER 
B<XX)LOR=-f80fm^OWRAP>MWTH><TH COLSPAN=2 AUGN=CENTER BCXX)LOR=-#80flffT 
NOWRAP>B*> 
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